-
Notifications
You must be signed in to change notification settings - Fork 106
Any speed up method since only run pick_coke_can_variant_agg.sh
takes 40 mins
#83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It's quite strange. Looks like GPU might be unused during eval (pure cpu eval is very slow). What GPU are you using and what's gpu utilization? You need to keep all scenes to get |
Hi, it is |
Do you have the correct tensorflow version?
|
Also if you are evaluating Octo, see https://github.com/simpler-env/SimplerEnv?tab=readme-ov-file#octo-inference-setup |
If model is not using GPU, a warning will be raised. |
Thank you! Did you have a advice for torch version? I am face this problem #30 (comment) . I am using torch 2.2.0 but is incompatible with tensorflow[and-cuda]==2.15.1 depency cuda lib. |
I don't think any part of the existing code uses torch? If so you can just insteall cpu torch I think. I locally have torch 2.3.1 but I haven't tested it for a long time. |
Thank you. torch 2.3.1 works fine for my openvla! I am using tensorflow 2.15.1
, I think it takes long in rendering. |
All policy take RGB images as input so RGB images are always rendered. If graphics is the problem then I think ffmpeg might be causing the issue. In this case (for basically all ffmpeg-caused slow video saving issues) it means there is a lack of system memory (and if you use a cluster, other jobs might be taking too much memory). |
Generally on 4090 a single episode of pick coke can is like 5 seconds for RT-1 |
Pick coke can uses rasterization so there's no ray tracing so env shouldn't be slow on non-RTX gpus; you can try to bench time for each component to see what's the slowest. |
My memory should be fine: 215GB /756GB. Did you means there is not option to (doesn't produce videos)? |
If so then ffmpeg shouldn't be the bottleneck. It might take several seconds to save 11 videos. |
Sorry I forget mention an important thing, maybe it is because my every episode is fail , so it reach the max steps, so it is slow?
model.step Step time: 0.039
Env step time: 0.048
one eposide Time taken: 10.73 on H100 |
Envs step (max episode 75 steps) 0.039*75 < 3s, so I think the policy forward takes ~7s; you can check |
The total 0.039+0.048=0.087 Total timestep: 80, 0.087*80=6.96 |
But it doesn't add up to 10s? |
First prepare part takes 1.76-1.91s
Second part, The total third part, save video and action takes around 0.85 s so around 1.76+7.5+0.85=10.11 |
Yeah for the same env the 1.76s can be saved by not re-creating the env and just reset the env w/ different robot & object pose ; but the majority of time is still 50% policy forward and 50% env step |
Thank you for your time! I will keep thinking |
Essentially the way to speed up both model inference & env is via parallelizing envs; it's already done in ManiSkill3 for widowx envs but not yet for google robot envs. |
Hi @xuanlinli17 @LukeLIN-web, For example, if |
Yes, you can modify the evaluation code to return early. |
Got it. Thanks! |
great idea! |
And we have to save video otherwise cannot count metrics now . I eval googlerobot
It takes me 16 hours in A6000 to eval googlerobot, really sad. And it takes 9.0G to store the generated MP4. |
Is it ok to parallelize it? I am worried that there are some issues when running multiple instances on a machine. (I have experience of other softwares that if you run multiple instances there are bugs) |
Bridge envs are parallelized in ManiSkill3. Google Robot envs tbd. |
Uh oh!
There was an error while loading. Please reload this page.
I don't use all coke_can_options_arr
# declare -a coke_can_options_arr=("lr_switch=True" "upright=True" "laid_vertically=True")
only
declare -a coke_can_options_arr=("upright=True")
But it still takes 40 mins. Is there any speed up method? Or is there any smaller test case?
I try to only run first block
scene_name=google_pick_coke_can_1_v4
and don't run the others likedeclare -a scene_arr=("Baked_sc1_staging_objaverse_cabinet1_h870"
But i cannot getstanding sim variant avg success
how can i get
standing sim variant avg success
when I don't run all blocks?The text was updated successfully, but these errors were encountered: