Skip to content

OSWorld benchmark #255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 56 commits into
base: main
Choose a base branch
from
Open

OSWorld benchmark #255

wants to merge 56 commits into from

Conversation

ollmer
Copy link
Collaborator

@ollmer ollmer commented Jun 24, 2025

TODO:

  • fill up the setup.md
  • write basic tests like the ones we have for Gaia bench in the test_gaia_agen.tpy
  • settle remained TODOs in code

Description by Korbit AI

What change is being made?

Add the OSWorld benchmark integration, including support for task management and execution environment setup, to allow AgentLab to process and evaluate tasks within the OSWorld framework.

Why are these changes being made?

These changes are introduced to support the integration of the OSWorld benchmark, facilitating the testing and benchmarking of the AgentLab's capabilities within an operating system environment. This approach allows for more comprehensive task evaluation, leveraging OSWorld’s rich set of desktop task scenarios to enhance the AgentLab's performance in automation and interaction tasks. Furthermore, it establishes a standardized method for setting up and running these evaluations using virtual machines, bringing consistency and robustness to the testing process.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

Copy link

korbit-ai bot commented Jun 24, 2025

Based on your review schedule, I'll hold off on reviewing this PR until it's marked as ready for review. If you'd like me to take a look now, comment /korbit-review.

Your admin can change your review schedule in the Korbit Console

@git clone https://github.com/xlang-ai/OSWorld || true
@echo "Modifying OSWorld requirements.txt to remove pinned versions..."
@cd OSWorld && \
sed -i.bak 's/numpy~=.*/numpy/' requirements.txt && \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat trick!

@ollmer ollmer changed the title [WIP] OSWorld benchmark OSWorld benchmark Jul 10, 2025
@ollmer ollmer requested a review from recursix July 10, 2025 17:10
@ollmer
Copy link
Collaborator Author

ollmer commented Jul 10, 2025

Code is good enough to run evaluations now, we intend to make it a little cleaner and add readme with the description of how to setup your VMs to run the eval, but otherwise its ready to review.

@ollmer ollmer marked this pull request as ready for review July 10, 2025 17:13
"source.organizeImports": "explicit",
"source.fixAll": "never"
}
"source.organizeImports": "always",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my opinionated update, let me know if you are not comfortable with that, we can roll it back

@@ -1333,7 +1333,7 @@ def plot_profiling(ax, step_info_list: list[StepInfo], summary_info: dict, progr
horizontalalignment="right",
rotation=0,
clip_on=True,
antialiased=True,
# antialiased=True,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matplotlib==3.7.5 does not support this.

@@ -405,7 +414,8 @@ def __init__(

def obs_preprocessor(self, obs):
obs = copy(obs)

if self.config.obs.use_osworld_obs_preprocessor:
return self.osworld_obs_preprocessor(obs)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda hacky solution. I think it's better to do all the preprocessing right inside the osworld gym method .step() and just introduce a universal flag skip_preprocessing here in the agent that will be set to True when the benchmark is osworld in def set_benchmark()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

action_set=OSWorldActionSet("computer_13"), # or "pyautogui"
)

OSWORLD_OAI = ToolUseAgentArgs(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's run eval with this config at least once, I remember having issues trying to do that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Performs poorly though.

Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Error Handling Missing Error Context in Timing Decorator ▹ view
Functionality Incomplete Benchmark Configuration ▹ view
Documentation Clarify Token Limit Parameter ▹ view
Performance Inefficient List Filtering with Set Membership ▹ view
Logging Print Statement Instead of Logger ▹ view ✅ Fix detected
Security Unsafe File Path Handling ▹ view ✅ Fix detected
Design Inconsistent Configuration Management ▹ view ✅ Fix detected
Readability Contradictory Comment vs Code Configuration ▹ view ✅ Fix detected
Functionality Empty OSWorld Preprocessor ▹ view ✅ Fix detected
Security Unsafe XML Parsing ▹ view
Files scanned
File Path Reviewed
experiments/run_osworld.py
src/agentlab/benchmarks/abstract_env.py
src/agentlab/benchmarks/osworld_axtree_preprocessing.py
src/agentlab/agents/tool_use_agent/tool_use_agent.py
src/agentlab/llm/response_api.py
src/agentlab/analyze/inspect_results.py
src/agentlab/experiments/study.py
src/agentlab/analyze/agent_xray.py
src/agentlab/benchmarks/osworld.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

@wraps(step_func)
def wrapped_step(self, action: str):
action_exec_start = time.time()
obs, reward, terminated, truncated, env_info = step_func(self, action)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Error Context in Timing Decorator category Error Handling

Tell me more
What is the issue?

The step_func call in the timing decorator lacks error handling, potentially losing timing context if an exception occurs.

Why this matters

If step_func raises an exception, the timing information and context would be lost, making it harder to debug performance issues or failures.

Suggested change ∙ Feature Preview
    @wraps(step_func)
    def wrapped_step(self, action: str):
        action_exec_start = time.time()
        try:
            obs, reward, terminated, truncated, env_info = step_func(self, action)
            action_exec_stop = time.time()

            # Ensure env_info is a dictionary
            if env_info is None:
                env_info = {}

            if "action_exec_start" not in env_info:
                env_info["action_exec_start"] = action_exec_start
            if "action_exec_stop" not in env_info:
                env_info["action_exec_stop"] = action_exec_stop
            if "action_exec_timeout" not in env_info:
                env_info["action_exec_timeout"] = 0.0

            return obs, reward, terminated, truncated, env_info
        except Exception as e:
            action_exec_stop = time.time()
            # Re-raise with timing context
            raise type(e)(f"Error during step (duration: {action_exec_stop - action_exec_start:.3f}s): {str(e)}") from e
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.



def get_task_ids() -> set[str]:
with open("experiments/osworld_debug_task_ids.json", "r") as f:

This comment was marked as resolved.

if os.environ.get("AGENTLAB_DEBUG"):
task_ids = get_task_ids()
study.exp_args_list = [exp_args for exp_args in study.exp_args_list if exp_args.env_args.task["id"] in task_ids] # type: ignore
print(f"Debug on {len(study.exp_args_list)} experiments")

This comment was marked as resolved.

Comment on lines +31 to +32
task_ids = get_task_ids()
study.exp_args_list = [exp_args for exp_args in study.exp_args_list if exp_args.env_args.task["id"] in task_ids] # type: ignore
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient List Filtering with Set Membership category Performance

Tell me more
What is the issue?

The task IDs are loaded from file and filtered using a list comprehension, which creates an unnecessary intermediate list and performs a membership test against a set for each item.

Why this matters

For large experiment lists, this creates memory overhead from the intermediate list and has O(n) complexity for each membership test against the task_ids set.

Suggested change ∙ Feature Preview

Use filter() with a lambda or generator expression to avoid creating intermediate lists:

study.exp_args_list = list(filter(lambda x: x.env_args.task["id"] in task_ids, study.exp_args_list))
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines 20 to 36
n_jobs = 1
os.environ["AGENTLAB_DEBUG"] = "1"
study = make_study(
benchmark=OsworldBenchmark(test_set_name="test_small.json"), # type: ignore
agent_args=[OSWORLD_CLAUDE],
comment="osworld debug 2",
logging_level=logging.INFO,
logging_level_stdout=logging.INFO,
)

if os.environ.get("AGENTLAB_DEBUG"):
task_ids = get_task_ids()
study.exp_args_list = [exp_args for exp_args in study.exp_args_list if exp_args.env_args.task["id"] in task_ids] # type: ignore
print(f"Debug on {len(study.exp_args_list)} experiments")
study.run(n_jobs=4, n_relaunch=1, parallel_backend="ray")
else:
study.run(n_jobs=n_jobs, n_relaunch=1, parallel_backend="ray")

This comment was marked as resolved.

Comment on lines 445 to 447
def osworld_obs_preprocessor(self, obs):
"""Preprocess the observation for OSWorld benchmark."""
return obs

This comment was marked as resolved.

Comment on lines 381 to 385
def set_benchmark(self, benchmark: AgentLabBenchmark | BgymBenchmark, demo_mode: bool):
"""Set benchmark specific flags."""
benchmark_name = benchmark.name
if benchmark_name == "osworld":
self.config.obs.use_osworld_obs_preprocessor = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete Benchmark Configuration category Functionality

Tell me more
What is the issue?

The benchmark setup only sets a flag for OSWorld but doesn't configure other essential benchmark-specific parameters that might be needed for proper functionality.

Why this matters

Incomplete benchmark configuration could lead to the agent using incorrect action sets or observation processing methods for different benchmark environments.

Suggested change ∙ Feature Preview

Enhance benchmark configuration:

def set_benchmark(self, benchmark: AgentLabBenchmark | BgymBenchmark, demo_mode: bool):
    """Set benchmark specific flags and configurations."""
    benchmark_name = benchmark.name
    if benchmark_name == "osworld":
        self.config.obs.use_osworld_obs_preprocessor = True
        self.config.summarizer.do_summary = False  # OSWorld typically doesn't need summarization
        self.action_set = OSWorldActionSet("computer_13")
        self.config.action_subsets = ("coord",)
    elif benchmark_name == "browsergym":
        self.config.obs.use_dom = True
        self.config.obs.use_axtree = True
        self.action_set = bgym.HighLevelActionSet(("bid", "coord"))
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines 568 to 581
OSWORLD_CLAUDE = ToolUseAgentArgs(
model_args=CLAUDE_MODEL_CONFIG,
config=PromptConfig(
tag_screenshot=True,
goal=Goal(goal_as_system_msg=True),
obs=Obs(
use_last_error=True,
use_screenshot=True,
use_axtree=True,
use_dom=False,
use_som=False,
use_tabs=False,
),
summarizer=Summarizer(do_summary=True), # do not summarize in OSWorld

This comment was marked as resolved.

return marks, drew_nodes, tagged_screenshot, element_list


def trim_accessibility_tree(linearized_accessibility_tree, max_tokens):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify Token Limit Parameter category Documentation

Tell me more
What is the issue?

The max_tokens parameter purpose and unit (e.g., GPT tokens vs characters) is unclear.

Why this matters

Without understanding the unit of max_tokens, developers might pass incorrect values leading to unexpected truncation.

Suggested change ∙ Feature Preview

def trim_accessibility_tree(linearized_accessibility_tree, max_tokens: int) -> str:
"""Truncate accessibility tree to fit within GPT token limit.

Args:
    linearized_accessibility_tree: The tree to truncate
    max_tokens: Maximum number of GPT-4 tokens to allow
"""
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

if not xlm_file_str:
return []

root = ET.fromstring(xlm_file_str)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsafe XML Parsing category Security

Tell me more
What is the issue?

XML parsing without protection against XXE (XML External Entity) attacks

Why this matters

Allows malicious XML input to potentially extract sensitive files, execute remote requests, or cause denial of service through entity expansion attacks

Suggested change ∙ Feature Preview
# Add entity protection
parser = ET.XMLParser()
parser.entity_declaration = lambda *args, **kwargs: None
root = ET.fromstring(xlm_file_str, parser=parser)
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants