OSWorld benchmark #255

ollmer · 2025-06-24T16:34:13Z

TODO:

fill up the setup.md
write basic tests like the ones we have for Gaia bench in the test_gaia_agen.tpy
settle remained TODOs in code

Description by Korbit AI

What change is being made?

Add the OSWorld benchmark integration, including support for task management and execution environment setup, to allow AgentLab to process and evaluate tasks within the OSWorld framework.

Why are these changes being made?

These changes are introduced to support the integration of the OSWorld benchmark, facilitating the testing and benchmarking of the AgentLab's capabilities within an operating system environment. This approach allows for more comprehensive task evaluation, leveraging OSWorld’s rich set of desktop task scenarios to enhance the AgentLab's performance in automation and interaction tasks. Furthermore, it establishes a standardized method for setting up and running these evaluations using virtual machines, bringing consistency and robustness to the testing process.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

korbit-ai · 2025-06-24T16:34:20Z

Based on your review schedule, I'll hold off on reviewing this PR until it's marked as ready for review. If you'd like me to take a look now, comment /korbit-review.

Your admin can change your review schedule in the Korbit Console

ollmer · 2025-06-26T09:49:13Z

Makefile

+	@git clone https://github.com/xlang-ai/OSWorld || true
+	@echo "Modifying OSWorld requirements.txt to remove pinned versions..."
+	@cd OSWorld && \
+		sed -i.bak 's/numpy~=.*/numpy/' requirements.txt && \


neat trick!

…eak its deps

…paces.

…it control

ollmer · 2025-07-10T17:11:37Z

Code is good enough to run evaluations now, we intend to make it a little cleaner and add readme with the description of how to setup your VMs to run the eval, but otherwise its ready to review.

ollmer · 2025-07-10T17:14:06Z

.vscode/settings.json

-            "source.organizeImports": "explicit",
-            "source.fixAll": "never"
-        }
+            "source.organizeImports": "always",


It was my opinionated update, let me know if you are not comfortable with that, we can roll it back

ollmer · 2025-07-10T17:14:35Z

src/agentlab/analyze/agent_xray.py

@@ -1333,7 +1333,7 @@ def plot_profiling(ax, step_info_list: list[StepInfo], summary_info: dict, progr
                horizontalalignment="right",
                rotation=0,
                clip_on=True,
-                antialiased=True,
+                # antialiased=True,


what's this?

matplotlib==3.7.5 does not support this.

ollmer · 2025-07-10T17:17:02Z

src/agentlab/agents/tool_use_agent/tool_use_agent.py

@@ -405,7 +414,8 @@ def __init__(

    def obs_preprocessor(self, obs):
        obs = copy(obs)
-
+        if self.config.obs.use_osworld_obs_preprocessor:
+            return self.osworld_obs_preprocessor(obs)


Kinda hacky solution. I think it's better to do all the preprocessing right inside the osworld gym method .step() and just introduce a universal flag skip_preprocessing here in the agent that will be set to True when the benchmark is osworld in def set_benchmark()

ollmer · 2025-07-10T17:18:12Z

src/agentlab/agents/tool_use_agent/tool_use_agent.py

+    action_set=OSWorldActionSet("computer_13"),  # or "pyautogui"
+)
+
+OSWORLD_OAI = ToolUseAgentArgs(


Let's run eval with this config at least once, I remember having issues trying to do that

Done! Performs poorly though.

korbit-ai

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

Category	Issue	Status
	Missing Error Context in Timing Decorator ▹ view
	Incomplete Benchmark Configuration ▹ view
	Clarify Token Limit Parameter ▹ view
	Inefficient List Filtering with Set Membership ▹ view
	Print Statement Instead of Logger ▹ view	✅ Fix detected
	Unsafe File Path Handling ▹ view	✅ Fix detected
	Inconsistent Configuration Management ▹ view	✅ Fix detected
	Contradictory Comment vs Code Configuration ▹ view	✅ Fix detected
	Empty OSWorld Preprocessor ▹ view	✅ Fix detected
	Unsafe XML Parsing ▹ view

Files scanned

File Path	Reviewed
experiments/run_osworld.py	✅
src/agentlab/benchmarks/abstract_env.py	✅
src/agentlab/benchmarks/osworld_axtree_preprocessing.py	✅
src/agentlab/agents/tool_use_agent/tool_use_agent.py	✅
src/agentlab/llm/response_api.py	✅
src/agentlab/analyze/inspect_results.py	✅
src/agentlab/experiments/study.py	✅
src/agentlab/analyze/agent_xray.py	✅
src/agentlab/benchmarks/osworld.py	✅

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

korbit-ai · 2025-07-10T17:18:37Z

src/agentlab/benchmarks/abstract_env.py

+    @wraps(step_func)
+    def wrapped_step(self, action: str):
+        action_exec_start = time.time()
+        obs, reward, terminated, truncated, env_info = step_func(self, action)


Missing Error Context in Timing Decorator

Tell me more

What is the issue?

The step_func call in the timing decorator lacks error handling, potentially losing timing context if an exception occurs.

Why this matters

If step_func raises an exception, the timing information and context would be lost, making it harder to debug performance issues or failures.

Suggested change ∙ Feature Preview

@wraps(step_func) def wrapped_step(self, action: str): action_exec_start = time.time() try: obs, reward, terminated, truncated, env_info = step_func(self, action) action_exec_stop = time.time() # Ensure env_info is a dictionary if env_info is None: env_info = {} if "action_exec_start" not in env_info: env_info["action_exec_start"] = action_exec_start if "action_exec_stop" not in env_info: env_info["action_exec_stop"] = action_exec_stop if "action_exec_timeout" not in env_info: env_info["action_exec_timeout"] = 0.0 return obs, reward, terminated, truncated, env_info except Exception as e: action_exec_stop = time.time() # Re-raise with timing context raise type(e)(f"Error during step (duration: {action_exec_stop - action_exec_start:.3f}s): {str(e)}") from e

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

experiments/run_osworld.py

+
+
+def get_task_ids() -> set[str]:
+    with open("experiments/osworld_debug_task_ids.json", "r") as f:


experiments/run_osworld.py

+    if os.environ.get("AGENTLAB_DEBUG"):
+        task_ids = get_task_ids()
+        study.exp_args_list = [exp_args for exp_args in study.exp_args_list if exp_args.env_args.task["id"] in task_ids]  # type: ignore
+        print(f"Debug on {len(study.exp_args_list)} experiments")


korbit-ai · 2025-07-10T17:18:38Z

experiments/run_osworld.py

+        task_ids = get_task_ids()
+        study.exp_args_list = [exp_args for exp_args in study.exp_args_list if exp_args.env_args.task["id"] in task_ids]  # type: ignore


Inefficient List Filtering with Set Membership

Tell me more

What is the issue?

The task IDs are loaded from file and filtered using a list comprehension, which creates an unnecessary intermediate list and performs a membership test against a set for each item.

Why this matters

For large experiment lists, this creates memory overhead from the intermediate list and has O(n) complexity for each membership test against the task_ids set.

Suggested change ∙ Feature Preview

Use filter() with a lambda or generator expression to avoid creating intermediate lists:

study.exp_args_list = list(filter(lambda x: x.env_args.task["id"] in task_ids, study.exp_args_list))

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

experiments/run_osworld.py

+    n_jobs = 1
+    os.environ["AGENTLAB_DEBUG"] = "1"
+    study = make_study(
+        benchmark=OsworldBenchmark(test_set_name="test_small.json"),  # type: ignore
+        agent_args=[OSWORLD_CLAUDE],
+        comment="osworld debug 2",
+        logging_level=logging.INFO,
+        logging_level_stdout=logging.INFO,
+    )
+
+    if os.environ.get("AGENTLAB_DEBUG"):
+        task_ids = get_task_ids()
+        study.exp_args_list = [exp_args for exp_args in study.exp_args_list if exp_args.env_args.task["id"] in task_ids]  # type: ignore
+        print(f"Debug on {len(study.exp_args_list)} experiments")
+        study.run(n_jobs=4, n_relaunch=1, parallel_backend="ray")
+    else:
+        study.run(n_jobs=n_jobs, n_relaunch=1, parallel_backend="ray")


src/agentlab/agents/tool_use_agent/tool_use_agent.py

+    def osworld_obs_preprocessor(self, obs):
+        """Preprocess the observation for OSWorld benchmark."""
+        return obs


korbit-ai · 2025-07-10T17:18:38Z

src/agentlab/agents/tool_use_agent/tool_use_agent.py

+    def set_benchmark(self, benchmark: AgentLabBenchmark | BgymBenchmark, demo_mode: bool):
+        """Set benchmark specific flags."""
+        benchmark_name = benchmark.name
+        if benchmark_name == "osworld":
+            self.config.obs.use_osworld_obs_preprocessor = True


Incomplete Benchmark Configuration

Tell me more

What is the issue?

The benchmark setup only sets a flag for OSWorld but doesn't configure other essential benchmark-specific parameters that might be needed for proper functionality.

Why this matters

Incomplete benchmark configuration could lead to the agent using incorrect action sets or observation processing methods for different benchmark environments.

Suggested change ∙ Feature Preview

Enhance benchmark configuration:

def set_benchmark(self, benchmark: AgentLabBenchmark | BgymBenchmark, demo_mode: bool): """Set benchmark specific flags and configurations.""" benchmark_name = benchmark.name if benchmark_name == "osworld": self.config.obs.use_osworld_obs_preprocessor = True self.config.summarizer.do_summary = False # OSWorld typically doesn't need summarization self.action_set = OSWorldActionSet("computer_13") self.config.action_subsets = ("coord",) elif benchmark_name == "browsergym": self.config.obs.use_dom = True self.config.obs.use_axtree = True self.action_set = bgym.HighLevelActionSet(("bid", "coord"))

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

src/agentlab/agents/tool_use_agent/tool_use_agent.py

+OSWORLD_CLAUDE = ToolUseAgentArgs(
+    model_args=CLAUDE_MODEL_CONFIG,
+    config=PromptConfig(
+        tag_screenshot=True,
+        goal=Goal(goal_as_system_msg=True),
+        obs=Obs(
+            use_last_error=True,
+            use_screenshot=True,
+            use_axtree=True,
+            use_dom=False,
+            use_som=False,
+            use_tabs=False,
+        ),
+        summarizer=Summarizer(do_summary=True),  # do not summarize in OSWorld


korbit-ai · 2025-07-10T17:18:38Z

src/agentlab/benchmarks/osworld_axtree_preprocessing.py

+    return marks, drew_nodes, tagged_screenshot, element_list
+
+
+def trim_accessibility_tree(linearized_accessibility_tree, max_tokens):


Clarify Token Limit Parameter

Tell me more

What is the issue?

The max_tokens parameter purpose and unit (e.g., GPT tokens vs characters) is unclear.

Why this matters

Without understanding the unit of max_tokens, developers might pass incorrect values leading to unexpected truncation.

Suggested change ∙ Feature Preview

def trim_accessibility_tree(linearized_accessibility_tree, max_tokens: int) -> str:
"""Truncate accessibility tree to fit within GPT token limit.

Args: linearized_accessibility_tree: The tree to truncate max_tokens: Maximum number of GPT-4 tokens to allow """

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

korbit-ai · 2025-07-10T17:18:38Z

src/agentlab/benchmarks/osworld_axtree_preprocessing.py

+    if not xlm_file_str:
+        return []
+
+    root = ET.fromstring(xlm_file_str)


Unsafe XML Parsing

Tell me more

What is the issue?

XML parsing without protection against XXE (XML External Entity) attacks

Why this matters

Allows malicious XML input to potentially extract sensitive files, execute remote requests, or cause denial of service through entity expansion attacks

Suggested change ∙ Feature Preview

# Add entity protection parser = ET.XMLParser() parser.entity_declaration = lambda *args, **kwargs: None root = ET.fromstring(xlm_file_str, parser=parser)

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

…or clarity

…OSWorldActionSet a dataclass for proper repr.

…IONS_OAI_CHATCOMPLETION_TOOLS and add format_chat_completion_tools_to_response_api function for tool format conversion

boilerplate

adbaf2d

ollmer and others added 4 commits June 24, 2025 18:35

fix import

f29f048

default args in the dataclass

30742cf

script to test that osworld works in the ubuntu with docker

a8b29cc

update makefile to setup os world development.

9e07d8f

ollmer commented Jun 26, 2025

View reviewed changes

ollmer and others added 23 commits June 26, 2025 10:14

install osworld through make command instead of requirements as we tw…

1f5d1da

…eak its deps

ignore osworld tmp folders

41e5298

osworld bench tasks loading

5448b48

fmt

77c1d1c

osworld eval entrypoint and fixes

04505ec

osworld action set boilerplate

1b4fccf

boilerplate for obs conversion

20524da

add convert_obs to env reset function

3728a3b

hardcoded os-world env in tool use agent (Refactor later)

475a3f3

add timing decorator to step function for action execution metrics

11edee5

pre-alpha initial working agent on os-world with desktop_env action s…

42b38ca

…paces.

Add TODO's

928fbaf

Update TODO's

3bf6c0a

claude and oai config for osworld agent

725546c

enforce format and stricter type checks

c1ec395

pass action set through the agent config

3608c43

Add set_benchmark for tool-use agent to use os_world obs preprocessor.

585d9f7

Merge remote branch 'main' into osworld

bdb7ab1

Update Claude agent config to include axtree and obs history

a40aa42

Add osworld axtree preprocessing

400b947

Add max_steps parameter to OsworldGym and OsworldEnvArgs for step lim…

2d7d5a2

…it control

Add env.evaluate for episode evaluation

0dbb9dd

Refactor observation conversion, add axtree and remove Todos.

7f6b6c9

ollmer and others added 5 commits July 8, 2025 12:11

use subset of simple tasks during debug run

49fac6c

Temp commit for xray [Update toolagent config to be primitive types]

2b79b50

record task video, wait 60 sec after reset just as osworld own agent

815893c

put video recording under flag, lint

7449033

lint

d7401bf

ollmer changed the title ~~[WIP] OSWorld benchmark~~ OSWorld benchmark Jul 10, 2025

ollmer requested a review from recursix July 10, 2025 17:10

Merge branch 'main' into osworld

7387922

ollmer marked this pull request as ready for review July 10, 2025 17:13

ollmer commented Jul 10, 2025

View reviewed changes

korbit-ai bot reviewed Jul 10, 2025

View reviewed changes

amanjaiswal73892 added 15 commits July 10, 2025 14:39

refactor: rename use_osworld_obs_preprocessor to skip_preprocessing f…

cf4b277

…or clarity

Remove 'action_set' from index_black_list in load_result_df and make …

bb38053

…OSWorldActionSet a dataclass for proper repr.

fix: rename COMPUTER_13_ACTIONS_OAI_RESPONSE_TOOLS to COMPUTER_13_ACT…

63d141b

…IONS_OAI_CHATCOMPLETION_TOOLS and add format_chat_completion_tools_to_response_api function for tool format conversion

update run_osworld.py with study relaunch capability and setup readme

d36709a

update TODO and black refactor

9748ec3

Rename tool conversion function

725854b

bug fix to_tool_desc and refactor

a22eaed

Add tests

8c2d469

Black and darglint

f740812

Merge remote-tracking branch 'origin/main' into osworld

532047a

more black

d2d59bc

Update osworld to be skipped if desktop_env not available

4f14015

add selective import for osworld module and tests.

896e89a

black formatting again

8bee45f

Add OSWorld benchmark to README

60d7ce2



		def get_task_ids() -> set[str]:
		with open("experiments/osworld_debug_task_ids.json", "r") as f:

		task_ids = get_task_ids()
		study.exp_args_list = [exp_args for exp_args in study.exp_args_list if exp_args.env_args.task["id"] in task_ids] # type: ignore

		return marks, drew_nodes, tagged_screenshot, element_list


		def trim_accessibility_tree(linearized_accessibility_tree, max_tokens):

OSWorld benchmark #255

Are you sure you want to change the base?

OSWorld benchmark #255

Conversation

ollmer commented Jun 24, 2025 • edited by amanjaiswal73892 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description by Korbit AI

What change is being made?

Why are these changes being made?

Uh oh!

korbit-ai bot commented Jun 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ollmer commented Jul 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

korbit-ai bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

Uh oh!

korbit-ai bot Jul 10, 2025

Choose a reason for hiding this comment

Missing Error Context in Timing Decorator

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Provide feedback to improve future suggestions

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

korbit-ai bot Jul 10, 2025

Choose a reason for hiding this comment

Inefficient List Filtering with Set Membership

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Provide feedback to improve future suggestions

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

korbit-ai bot Jul 10, 2025

Choose a reason for hiding this comment

Incomplete Benchmark Configuration

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Provide feedback to improve future suggestions

Uh oh!

This comment was marked as resolved.

Uh oh!

korbit-ai bot Jul 10, 2025

Choose a reason for hiding this comment

Clarify Token Limit Parameter

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Provide feedback to improve future suggestions

Uh oh!

korbit-ai bot Jul 10, 2025

Choose a reason for hiding this comment

ollmer commented Jun 24, 2025 •

edited by amanjaiswal73892

Loading

korbit-ai bot left a comment •

edited

Loading