feat: add missing logic rai_bench #595

jmatejcz · 2025-05-19T10:44:12Z

Purpose

Add couple missing elements:

Action models, mocks to enable making tasks in tool calling agent benchmark that use actions
timeouts for benchmarks ( single scenario or task )
Langfuse integration for manipulation benchmark
Validation tab in visualise script not working properly

Also Improve user experience from using the rai bench package as it feels a bit awkward from user perspective ( i concluded that when writing docs)

Proposed Changes

Fixed existing mocks of camera topics
Added models and mocks of interfaces for messages used in navigation
Added timeout for single scenario in Manipulation bench (210 sec) and in Tool Calling Agent (60 sec)
Added Langfuse for manipulation bench. Also moved the score tracing code to results_processing/ dir
Moved code related to predefined benchmarks from examples/ to <benchmark>/predefined as tasks and scenarios defined by us should be part of package that user can import and use, rather than just example. Example dir now contains really only code that imports out package and uses it.
Introduced test_models function and benchmark configs that encapsulates all logic into just one function that can gather results from different models across different benchmarks. See rai_bench/test_models.py for code and rai_bench/example/benchmarking_models.py on how to use it.
Results are now stored as: 1 run dir = 1 benchmark, this was required as now user can run same benchmark but with couple different set of params and we have to differentiate that.
Visualise script now lets user chose couple runs and concatenate the results from them. Also they are sorted by date now.
Fixed loading validation info and now validator tab renders valid info
Restructured visualise code - seperate dir results_processing/visualise/ , script divided in couple files so it is more readable.
Added missing arguments to argparse.

Issues

#526
#462

Testing

testing multiple models in different configuration, visit the file, then run:

python src/rai_bench/rai_bench/examples/benchmarking_models.py

Run 1 benchmark, experiment with new flags:

python src/rai_bench/rai_bench/examples/manipulation_o3de.py --model-name qwen2.5:7b --vendor ollama

python src/rai_bench/rai_bench/examples/tool_calling_agent.py --model-name qwen2.5:7b --vendor ollama

See the streamlit script, explore the UI

streamlit run  src/rai_bench/rai_bench/examples/visualise_streamlit.py

adjusted timeouts in keep alive ollama and waiting for task

add selecting task types selected importables

moved tracing code to result_processing/

adjusted to new result changes fixed old bugs

maciejmajek

LGTM

jmatejcz force-pushed the jm/feat/easier-bench-running branch from 90ccf60 to 88fcfe8 Compare May 20, 2025 07:55

jmatejcz marked this pull request as ready for review May 20, 2025 07:55

jmatejcz requested a review from maciejmajek May 20, 2025 08:00

jmatejcz force-pushed the jm/feat/easier-bench-running branch from 88fcfe8 to 90ccf60 Compare May 20, 2025 08:09

jmatejcz added 19 commits May 20, 2025 10:11

fix: fixed camera topics in basic tasks

cd46631

fix: enable optional timeout param in tasks

eb69383

feat: added timeout for single task

ec9e251

style: mock tool type hint and log update

994a24b

feat: added timeout to scene setup

261801c

adjusted timeouts in keep alive ollama and waiting for task

refactor: moved run benchmark function to core package code

f765bcb

feat: sort folder in available runs in visualise script

fcdc12e

feat: navigation actions interfaces and models

f587cf4

refactor: moved predefined tasks, scenarios, configs to package core

b5a1989

fix: adjustments after rebase

01287c8

feat: add example on benchmarking models

aad639c

add selecting task types selected importables

refactor: saving results and visualising adjust

652c82d

feat: add langfuse tracing to manipulation bench

ef99be7

moved tracing code to result_processing/

refactor: moved subtasks from task dir

30373c1

refactor: result storing classes change

9b84213

refactor: new structure of visualise code

d074f9d

adjusted to new result changes fixed old bugs

fix: parsing validation info when type is argument

9a5a6bf

refactor: moved visualise script to examples

8ad05f7

chore: add licenses

35d5117

jmatejcz force-pushed the jm/feat/easier-bench-running branch from 90ccf60 to 35d5117 Compare May 20, 2025 08:12

feat: added more args to parsing args in examples

c260eed

maciejmajek approved these changes May 21, 2025

View reviewed changes

maciejmajek merged commit a46bc61 into development May 21, 2025
10 of 12 checks passed

maciejmajek deleted the jm/feat/easier-bench-running branch May 21, 2025 08:33

This was referenced May 26, 2025

Action mocks #526

Closed

Timeout for benchamark's scenario #462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add missing logic rai_bench #595

feat: add missing logic rai_bench #595

Uh oh!

jmatejcz commented May 19, 2025 •

edited

Loading

Uh oh!

maciejmajek left a comment

Uh oh!

Uh oh!

Uh oh!

feat: add missing logic rai_bench #595

feat: add missing logic rai_bench #595

Uh oh!

Conversation

jmatejcz commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Proposed Changes

Issues

Testing

Uh oh!

maciejmajek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jmatejcz commented May 19, 2025 •

edited

Loading