Skip to content

feat: add missing logic rai_bench #595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
May 21, 2025

Conversation

jmatejcz
Copy link
Contributor

@jmatejcz jmatejcz commented May 19, 2025

Purpose

Add couple missing elements:

  • Action models, mocks to enable making tasks in tool calling agent benchmark that use actions
  • timeouts for benchmarks ( single scenario or task )
  • Langfuse integration for manipulation benchmark
  • Validation tab in visualise script not working properly

Also Improve user experience from using the rai bench package as it feels a bit awkward from user perspective ( i concluded that when writing docs)

Proposed Changes

  • Fixed existing mocks of camera topics

  • Added models and mocks of interfaces for messages used in navigation

  • Added timeout for single scenario in Manipulation bench (210 sec) and in Tool Calling Agent (60 sec)

  • Added Langfuse for manipulation bench. Also moved the score tracing code to results_processing/ dir

  • Moved code related to predefined benchmarks from examples/ to <benchmark>/predefined as tasks and scenarios defined by us should be part of package that user can import and use, rather than just example. Example dir now contains really only code that imports out package and uses it.

  • Introduced test_models function and benchmark configs that encapsulates all logic into just one function that can gather results from different models across different benchmarks. See rai_bench/test_models.py for code and rai_bench/example/benchmarking_models.py on how to use it.

  • Results are now stored as: 1 run dir = 1 benchmark, this was required as now user can run same benchmark but with couple different set of params and we have to differentiate that.

  • Visualise script now lets user chose couple runs and concatenate the results from them. Also they are sorted by date now.

  • Fixed loading validation info and now validator tab renders valid info

  • Restructured visualise code - seperate dir results_processing/visualise/ , script divided in couple files so it is more readable.

  • Added missing arguments to argparse.

Issues

#526
#462

Testing

  1. testing multiple models in different configuration, visit the file, then run:
python src/rai_bench/rai_bench/examples/benchmarking_models.py 
  1. Run 1 benchmark, experiment with new flags:
python src/rai_bench/rai_bench/examples/manipulation_o3de.py --model-name qwen2.5:7b --vendor ollama
python src/rai_bench/rai_bench/examples/tool_calling_agent.py --model-name qwen2.5:7b --vendor ollama
  1. See the streamlit script, explore the UI
streamlit run  src/rai_bench/rai_bench/examples/visualise_streamlit.py

@jmatejcz jmatejcz force-pushed the jm/feat/easier-bench-running branch from 90ccf60 to 88fcfe8 Compare May 20, 2025 07:55
@jmatejcz jmatejcz marked this pull request as ready for review May 20, 2025 07:55
@jmatejcz jmatejcz requested a review from maciejmajek May 20, 2025 08:00
@jmatejcz jmatejcz force-pushed the jm/feat/easier-bench-running branch from 88fcfe8 to 90ccf60 Compare May 20, 2025 08:09
@jmatejcz jmatejcz force-pushed the jm/feat/easier-bench-running branch from 90ccf60 to 35d5117 Compare May 20, 2025 08:12
Copy link
Member

@maciejmajek maciejmajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maciejmajek maciejmajek merged commit a46bc61 into development May 21, 2025
10 of 12 checks passed
@maciejmajek maciejmajek deleted the jm/feat/easier-bench-running branch May 21, 2025 08:33
This was referenced May 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants