-
Notifications
You must be signed in to change notification settings - Fork 75
OSWorld benchmark #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OSWorld benchmark #255
Conversation
Based on your review schedule, I'll hold off on reviewing this PR until it's marked as ready for review. If you'd like me to take a look now, comment
|
@git clone https://github.com/xlang-ai/OSWorld || true | ||
@echo "Modifying OSWorld requirements.txt to remove pinned versions..." | ||
@cd OSWorld && \ | ||
sed -i.bak 's/numpy~=.*/numpy/' requirements.txt && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat trick!
Code is good enough to run evaluations now, we intend to make it a little cleaner and add readme with the description of how to setup your VMs to run the eval, but otherwise its ready to review. |
"source.organizeImports": "explicit", | ||
"source.fixAll": "never" | ||
} | ||
"source.organizeImports": "always", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was my opinionated update, let me know if you are not comfortable with that, we can roll it back
@@ -1333,7 +1333,7 @@ def plot_profiling(ax, step_info_list: list[StepInfo], summary_info: dict, progr | |||
horizontalalignment="right", | |||
rotation=0, | |||
clip_on=True, | |||
antialiased=True, | |||
# antialiased=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matplotlib==3.7.5 does not support this.
@@ -405,7 +414,8 @@ def __init__( | |||
|
|||
def obs_preprocessor(self, obs): | |||
obs = copy(obs) | |||
|
|||
if self.config.obs.use_osworld_obs_preprocessor: | |||
return self.osworld_obs_preprocessor(obs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kinda hacky solution. I think it's better to do all the preprocessing right inside the osworld gym method .step() and just introduce a universal flag skip_preprocessing
here in the agent that will be set to True when the benchmark is osworld in def set_benchmark()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
action_set=OSWorldActionSet("computer_13"), # or "pyautogui" | ||
) | ||
|
||
OSWORLD_OAI = ToolUseAgentArgs( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's run eval with this config at least once, I remember having issues trying to do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! Performs poorly though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
Category | Issue | Status |
---|---|---|
Missing Error Context in Timing Decorator ▹ view | ||
Incomplete Benchmark Configuration ▹ view | ||
Clarify Token Limit Parameter ▹ view | ||
Inefficient List Filtering with Set Membership ▹ view | ||
Print Statement Instead of Logger ▹ view | ✅ Fix detected | |
Unsafe File Path Handling ▹ view | ✅ Fix detected | |
Inconsistent Configuration Management ▹ view | ✅ Fix detected | |
Contradictory Comment vs Code Configuration ▹ view | ✅ Fix detected | |
Empty OSWorld Preprocessor ▹ view | ✅ Fix detected | |
Unsafe XML Parsing ▹ view |
Files scanned
File Path | Reviewed |
---|---|
experiments/run_osworld.py | ✅ |
src/agentlab/benchmarks/abstract_env.py | ✅ |
src/agentlab/benchmarks/osworld_axtree_preprocessing.py | ✅ |
src/agentlab/agents/tool_use_agent/tool_use_agent.py | ✅ |
src/agentlab/llm/response_api.py | ✅ |
src/agentlab/analyze/inspect_results.py | ✅ |
src/agentlab/experiments/study.py | ✅ |
src/agentlab/analyze/agent_xray.py | ✅ |
src/agentlab/benchmarks/osworld.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Check out our docs on how you can make Korbit work best for you and your team.
@wraps(step_func) | ||
def wrapped_step(self, action: str): | ||
action_exec_start = time.time() | ||
obs, reward, terminated, truncated, env_info = step_func(self, action) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing Error Context in Timing Decorator 
Tell me more
What is the issue?
The step_func call in the timing decorator lacks error handling, potentially losing timing context if an exception occurs.
Why this matters
If step_func raises an exception, the timing information and context would be lost, making it harder to debug performance issues or failures.
Suggested change ∙ Feature Preview
@wraps(step_func)
def wrapped_step(self, action: str):
action_exec_start = time.time()
try:
obs, reward, terminated, truncated, env_info = step_func(self, action)
action_exec_stop = time.time()
# Ensure env_info is a dictionary
if env_info is None:
env_info = {}
if "action_exec_start" not in env_info:
env_info["action_exec_start"] = action_exec_start
if "action_exec_stop" not in env_info:
env_info["action_exec_stop"] = action_exec_stop
if "action_exec_timeout" not in env_info:
env_info["action_exec_timeout"] = 0.0
return obs, reward, terminated, truncated, env_info
except Exception as e:
action_exec_stop = time.time()
# Re-raise with timing context
raise type(e)(f"Error during step (duration: {action_exec_stop - action_exec_start:.3f}s): {str(e)}") from e
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
|
||
|
||
def get_task_ids() -> set[str]: | ||
with open("experiments/osworld_debug_task_ids.json", "r") as f: |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
if os.environ.get("AGENTLAB_DEBUG"): | ||
task_ids = get_task_ids() | ||
study.exp_args_list = [exp_args for exp_args in study.exp_args_list if exp_args.env_args.task["id"] in task_ids] # type: ignore | ||
print(f"Debug on {len(study.exp_args_list)} experiments") |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
task_ids = get_task_ids() | ||
study.exp_args_list = [exp_args for exp_args in study.exp_args_list if exp_args.env_args.task["id"] in task_ids] # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient List Filtering with Set Membership 
Tell me more
What is the issue?
The task IDs are loaded from file and filtered using a list comprehension, which creates an unnecessary intermediate list and performs a membership test against a set for each item.
Why this matters
For large experiment lists, this creates memory overhead from the intermediate list and has O(n) complexity for each membership test against the task_ids set.
Suggested change ∙ Feature Preview
Use filter() with a lambda or generator expression to avoid creating intermediate lists:
study.exp_args_list = list(filter(lambda x: x.env_args.task["id"] in task_ids, study.exp_args_list))
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
experiments/run_osworld.py
Outdated
n_jobs = 1 | ||
os.environ["AGENTLAB_DEBUG"] = "1" | ||
study = make_study( | ||
benchmark=OsworldBenchmark(test_set_name="test_small.json"), # type: ignore | ||
agent_args=[OSWORLD_CLAUDE], | ||
comment="osworld debug 2", | ||
logging_level=logging.INFO, | ||
logging_level_stdout=logging.INFO, | ||
) | ||
|
||
if os.environ.get("AGENTLAB_DEBUG"): | ||
task_ids = get_task_ids() | ||
study.exp_args_list = [exp_args for exp_args in study.exp_args_list if exp_args.env_args.task["id"] in task_ids] # type: ignore | ||
print(f"Debug on {len(study.exp_args_list)} experiments") | ||
study.run(n_jobs=4, n_relaunch=1, parallel_backend="ray") | ||
else: | ||
study.run(n_jobs=n_jobs, n_relaunch=1, parallel_backend="ray") |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
def osworld_obs_preprocessor(self, obs): | ||
"""Preprocess the observation for OSWorld benchmark.""" | ||
return obs |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
def set_benchmark(self, benchmark: AgentLabBenchmark | BgymBenchmark, demo_mode: bool): | ||
"""Set benchmark specific flags.""" | ||
benchmark_name = benchmark.name | ||
if benchmark_name == "osworld": | ||
self.config.obs.use_osworld_obs_preprocessor = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete Benchmark Configuration 
Tell me more
What is the issue?
The benchmark setup only sets a flag for OSWorld but doesn't configure other essential benchmark-specific parameters that might be needed for proper functionality.
Why this matters
Incomplete benchmark configuration could lead to the agent using incorrect action sets or observation processing methods for different benchmark environments.
Suggested change ∙ Feature Preview
Enhance benchmark configuration:
def set_benchmark(self, benchmark: AgentLabBenchmark | BgymBenchmark, demo_mode: bool):
"""Set benchmark specific flags and configurations."""
benchmark_name = benchmark.name
if benchmark_name == "osworld":
self.config.obs.use_osworld_obs_preprocessor = True
self.config.summarizer.do_summary = False # OSWorld typically doesn't need summarization
self.action_set = OSWorldActionSet("computer_13")
self.config.action_subsets = ("coord",)
elif benchmark_name == "browsergym":
self.config.obs.use_dom = True
self.config.obs.use_axtree = True
self.action_set = bgym.HighLevelActionSet(("bid", "coord"))
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
OSWORLD_CLAUDE = ToolUseAgentArgs( | ||
model_args=CLAUDE_MODEL_CONFIG, | ||
config=PromptConfig( | ||
tag_screenshot=True, | ||
goal=Goal(goal_as_system_msg=True), | ||
obs=Obs( | ||
use_last_error=True, | ||
use_screenshot=True, | ||
use_axtree=True, | ||
use_dom=False, | ||
use_som=False, | ||
use_tabs=False, | ||
), | ||
summarizer=Summarizer(do_summary=True), # do not summarize in OSWorld |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
return marks, drew_nodes, tagged_screenshot, element_list | ||
|
||
|
||
def trim_accessibility_tree(linearized_accessibility_tree, max_tokens): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarify Token Limit Parameter 
Tell me more
What is the issue?
The max_tokens parameter purpose and unit (e.g., GPT tokens vs characters) is unclear.
Why this matters
Without understanding the unit of max_tokens, developers might pass incorrect values leading to unexpected truncation.
Suggested change ∙ Feature Preview
def trim_accessibility_tree(linearized_accessibility_tree, max_tokens: int) -> str:
"""Truncate accessibility tree to fit within GPT token limit.
Args:
linearized_accessibility_tree: The tree to truncate
max_tokens: Maximum number of GPT-4 tokens to allow
"""
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
if not xlm_file_str: | ||
return [] | ||
|
||
root = ET.fromstring(xlm_file_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsafe XML Parsing 
Tell me more
What is the issue?
XML parsing without protection against XXE (XML External Entity) attacks
Why this matters
Allows malicious XML input to potentially extract sensitive files, execute remote requests, or cause denial of service through entity expansion attacks
Suggested change ∙ Feature Preview
# Add entity protection
parser = ET.XMLParser()
parser.entity_declaration = lambda *args, **kwargs: None
root = ET.fromstring(xlm_file_str, parser=parser)
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
…OSWorldActionSet a dataclass for proper repr.
…IONS_OAI_CHATCOMPLETION_TOOLS and add format_chat_completion_tools_to_response_api function for tool format conversion
TODO:
Description by Korbit AI
What change is being made?
Add the OSWorld benchmark integration, including support for task management and execution environment setup, to allow AgentLab to process and evaluate tasks within the OSWorld framework.
Why are these changes being made?
These changes are introduced to support the integration of the OSWorld benchmark, facilitating the testing and benchmarking of the AgentLab's capabilities within an operating system environment. This approach allows for more comprehensive task evaluation, leveraging OSWorld’s rich set of desktop task scenarios to enhance the AgentLab's performance in automation and interaction tasks. Furthermore, it establishes a standardized method for setting up and running these evaluations using virtual machines, bringing consistency and robustness to the testing process.