-
Notifications
You must be signed in to change notification settings - Fork 47
feat: add missing logic rai_bench #595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
90ccf60
to
88fcfe8
Compare
88fcfe8
to
90ccf60
Compare
adjusted timeouts in keep alive ollama and waiting for task
add selecting task types selected importables
moved tracing code to result_processing/
adjusted to new result changes fixed old bugs
90ccf60
to
35d5117
Compare
maciejmajek
approved these changes
May 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This was referenced May 26, 2025
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
Add couple missing elements:
Also Improve user experience from using the rai bench package as it feels a bit awkward from user perspective ( i concluded that when writing docs)
Proposed Changes
Fixed existing mocks of camera topics
Added models and mocks of interfaces for messages used in navigation
Added timeout for single scenario in Manipulation bench (210 sec) and in Tool Calling Agent (60 sec)
Added Langfuse for manipulation bench. Also moved the score tracing code to
results_processing/
dirMoved code related to predefined benchmarks from
examples/
to<benchmark>/predefined
as tasks and scenarios defined by us should be part of package that user can import and use, rather than just example. Example dir now contains really only code that imports out package and uses it.Introduced
test_models
function and benchmark configs that encapsulates all logic into just one function that can gather results from different models across different benchmarks. Seerai_bench/test_models.py
for code andrai_bench/example/benchmarking_models.py
on how to use it.Results are now stored as: 1 run dir = 1 benchmark, this was required as now user can run same benchmark but with couple different set of params and we have to differentiate that.
Visualise script now lets user chose couple runs and concatenate the results from them. Also they are sorted by date now.
Fixed loading validation info and now validator tab renders valid info
Restructured visualise code - seperate dir
results_processing/visualise/
, script divided in couple files so it is more readable.Added missing arguments to argparse.
Issues
#526
#462
Testing