Any speed up method since only run `pick_coke_can_variant_agg.sh` takes 40 mins #83

LukeLIN-web · 2025-04-11T21:32:30Z

I don't use all coke_can_options_arr # declare -a coke_can_options_arr=("lr_switch=True" "upright=True" "laid_vertically=True")
only declare -a coke_can_options_arr=("upright=True")
But it still takes 40 mins. Is there any speed up method? Or is there any smaller test case?

I try to only run first block
scene_name=google_pick_coke_can_1_v4 and don't run the others like declare -a scene_arr=("Baked_sc1_staging_objaverse_cabinet1_h870" But i cannot get standing sim variant avg success
how can i get standing sim variant avg success when I don't run all blocks?

The text was updated successfully, but these errors were encountered:

xuanlinli17 · 2025-04-11T22:18:45Z

It's quite strange. Looks like GPU might be unused during eval (pure cpu eval is very slow). What GPU are you using and what's gpu utilization?

You need to keep all scenes to get standing sim variant avg success.

LukeLIN-web · 2025-04-12T19:42:12Z

It's quite strange. Looks like GPU might be unused during eval (pure cpu eval is very slow). What GPU are you using and what's gpu utilization?

You need to keep all scenes to get standing sim variant avg success.

Hi, it is Compute 0% 16214MiB 17% 0% 3870MiB python simpler_env/main_inference.py Graphic 11% 16214MiB 17% 0% 3931MiB python simpler_env/main_inference.py --policy-model Maybe they are not use it not use to compute. I am using H100.

xuanlinli17 · 2025-04-12T20:36:07Z

Do you have the correct tensorflow version?

pip install tensorflow==2.15.0
pip install -r requirements_full_install.txt
pip install tensorflow[and-cuda]==2.15.1 # tensorflow gpu support

xuanlinli17 · 2025-04-12T20:37:15Z

Also if you are evaluating Octo, see https://github.com/simpler-env/SimplerEnv?tab=readme-ov-file#octo-inference-setup

xuanlinli17 · 2025-04-12T20:38:08Z

If model is not using GPU, a warning will be raised.

LukeLIN-web · 2025-04-12T20:58:23Z

Do you have the correct tensorflow version?

pip install tensorflow==2.15.0
pip install -r requirements_full_install.txt
pip install tensorflow[and-cuda]==2.15.1 # tensorflow gpu support

Thank you! Did you have a advice for torch version? I am face this problem #30 (comment) . I am using torch 2.2.0 but is incompatible with tensorflow[and-cuda]==2.15.1 depency cuda lib.

xuanlinli17 · 2025-04-12T21:32:45Z

I don't think any part of the existing code uses torch? If so you can just insteall cpu torch I think. I locally have torch 2.3.1 but I haven't tested it for a long time.

LukeLIN-web · 2025-04-12T22:35:34Z

I don't think any part of the existing code uses torch? If so you can just insteall cpu torch I think. I locally have torch 2.3.1 but I haven't tested it for a long time.

Thank you. torch 2.3.1 works fine for my openvla! I am using tensorflow 2.15.1
But the speed is still slow, three mins for 11 run_maniskill2_eval_single_episode

pick coke can
OrderedDict([('n_lift_significant', 0), ('consec_grasp', False), ('grasped', False)])
.... 11 times all false
pick coke can
OrderedDict([('n_lift_significant', 0), ('consec_grasp', False), ('grasped', False)])

, I think it takes long in rendering. Graphic 3-4x time longer than computing. I will try to not produce videos.

xuanlinli17 · 2025-04-12T22:37:56Z

All policy take RGB images as input so RGB images are always rendered. If graphics is the problem then I think ffmpeg might be causing the issue. In this case (for basically all ffmpeg-caused slow video saving issues) it means there is a lack of system memory (and if you use a cluster, other jobs might be taking too much memory).

xuanlinli17 · 2025-04-12T22:39:48Z

Generally on 4090 a single episode of pick coke can is like 5 seconds for RT-1

xuanlinli17 · 2025-04-12T22:41:23Z

Pick coke can uses rasterization so there's no ray tracing so env shouldn't be slow on non-RTX gpus; you can try to bench time for each component to see what's the slowest.

LukeLIN-web · 2025-04-12T22:44:45Z

All policy take RGB images as input so RGB images are always rendered. If graphics is the problem then I think ffmpeg might be causing the issue. In this case (for basically all ffmpeg-caused slow video saving issues) it means there is a lack of system memory (and if you use a cluster, other jobs might be taking too much memory).

My memory should be fine: 215GB /756GB. Did you means there is not option to (doesn't produce videos)?

xuanlinli17 · 2025-04-12T22:46:33Z

If so then ffmpeg shouldn't be the bottleneck. It might take several seconds to save 11 videos.

LukeLIN-web · 2025-04-12T22:54:45Z

Sorry I forget mention an important thing, maybe it is because my every episode is fail , so it reach the max steps, so it is slow?

        raw_action, action = model.step(image, task_description)
        predicted_actions.append(raw_action)
        predicted_terminated = bool(action["terminate_episode"][0] > 0)
        if predicted_terminated:
            if not is_final_subtask:
                # advance the environment to the next subtask
                predicted_terminated = False
                env.advance_to_next_subtask()

model.step Step time: 0.039

        env_start_time = time.time()
        # step the environment
        obs, reward, done, truncated, info = env.step(
            np.concatenate(
                [action["world_vector"], action["rot_axangle"], action["gripper"]]
            ),
        )
        env_end_time = time.time()
        print(f"Env step time: {env_end_time - env_start_time}")

Env step time: 0.048


    video_start_time = time.time()
    for k, v in additional_env_build_kwargs.items():
        env_save_name = env_save_name + f"_{k}_{v}"
    .... 
    video_path = os.path.join(logging_dir, video_path)
    write_video(video_path, images, fps=5)
    video_end_time = time.time()
    print(f"Video write time: {video_end_time - video_start_time}")

    action_start_time = time.time()
    # save action trajectory
    action_path = video_path.replace(".mp4", ".png")
    action_root = os.path.dirname(action_path) + "/actions/"
    os.makedirs(action_root, exist_ok=True)
    action_path = action_root + os.path.basename(action_path)
    model.visualize_epoch(predicted_actions, images, save_path=action_path)
    action_end_time = time.time()
    print(f"Action save time: {action_end_time - action_start_time}")

Video write time: 0.2038
Action save time: 0.659

one eposide Time taken: 10.73 on H100

xuanlinli17 · 2025-04-12T23:00:52Z

Envs step (max episode 75 steps) 0.039*75 < 3s, so I think the policy forward takes ~7s; you can check

LukeLIN-web · 2025-04-12T23:31:15Z

Envs step (max episode 75 steps) 0.039*75 < 3s, so I think the policy forward takes ~7s; you can check

The total while not (predicted_terminated or truncated): loops takes 7.29 - 8.37 s.
model.step() takes 0.039s/step
env.step( takes 0.048s/ step

0.039+0.048=0.087

Total timestep: 80, 0.087*80=6.96

xuanlinli17 · 2025-04-12T23:33:25Z

But it doesn't add up to 10s?

LukeLIN-web · 2025-04-12T23:42:10Z

First prepare part takes 1.76-1.91s

    if additional_env_build_kwargs is None:
        additional_env_build_kwargs = {}

    # Create environment
  ..... 
    # Initialize logging
    image = get_image_from_maniskill2_obs_dict(env, obs, camera_name=obs_camera_name) #可以用wrist吗? 
    images = [image]
    predicted_actions = []
    predicted_terminated, done, truncated = False, False, False

    # Initialize model
    model.reset(task_description)

    timestep = 0
    success = "failure"

Second part, The total while not (predicted_terminated or truncated): loops takes 7.29 - 8.37 s.

third part, save video and action takes around 0.85 s

so around 1.76+7.5+0.85=10.11

xuanlinli17 · 2025-04-12T23:47:32Z

Yeah for the same env the 1.76s can be saved by not re-creating the env and just reset the env w/ different robot & object pose ; but the majority of time is still 50% policy forward and 50% env step

LukeLIN-web · 2025-04-13T00:14:26Z

Yeah for the same env the 1.76s can be saved by not re-creating the env and just reset the env w/ different robot & object pose ; but the majority of time is still 50% policy forward and 50% env step

Thank you for your time! I will keep thinking

xuanlinli17 · 2025-04-13T00:25:36Z

Essentially the way to speed up both model inference & env is via parallelizing envs; it's already done in ManiSkill3 for widowx envs but not yet for google robot envs.

jasper0314-huang · 2025-04-16T07:58:31Z

Hi @xuanlinli17 @LukeLIN-web,
I have a question that might be stupid, can we return early from an episode once success has been achieved?

For example, if success becomes True at the 12th step, all the remaining steps seem unnecessary.
Would it be okay to just break out of the loop at that point?

xuanlinli17 · 2025-04-16T08:08:02Z

Yes, you can modify the evaluation code to return early.

jasper0314-huang · 2025-04-16T08:15:38Z

Got it. Thanks!

LukeLIN-web · 2025-04-16T18:01:26Z

Hi @xuanlinli17 @LukeLIN-web, 你好 I have a question that might be stupid, can we return early from an episode once success has been achieved?我有一个可能很愚蠢的问题，一旦成功，我们可以提前从发作中回来吗？

For example, if success becomes True at the 12th step, all the remaining steps seem unnecessary.例如，如果 success 在第 12 步变为 True，则其余所有步骤似乎都是不必要的。 Would it be okay to just break out of the loop at that point?在那个时候跳出循环可以吗？

great idea!

LukeLIN-web · 2025-04-18T01:33:59Z

save video and action takes around 0.85 s

And we have to save video otherwise cannot count metrics now .

I eval googlerobot

tasks = [
    "pick_coke_can_visual_matching.sh",
    "pick_coke_can_variant_agg.sh",
    "move_near_variant_agg.sh",
    "move_near_visual_matching.sh",
    "drawer_visual_matching.sh",
    "drawer_variant_agg.sh",
]

It takes me 16 hours in A6000 to eval googlerobot, really sad.

And it takes 9.0G to store the generated MP4.

Boltzmachine · 2025-06-09T17:35:32Z

Is it ok to parallelize it? I am worried that there are some issues when running multiple instances on a machine. (I have experience of other softwares that if you run multiple instances there are bugs)

xuanlinli17 · 2025-06-09T18:20:14Z

Bridge envs are parallelized in ManiSkill3. Google Robot envs tbd.

LukeLIN-web mentioned this issue Apr 14, 2025

The improvement of evaluation efficiency. Robot-VLAs/RoboVLMs#23

Open

Any speed up method since only run pick_coke_can_variant_agg.sh takes 40 mins #83

Any speed up method since only run pick_coke_can_variant_agg.sh takes 40 mins #83

Comments

LukeLIN-web commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

xuanlinli17 commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukeLIN-web commented Apr 12, 2025

Uh oh!

xuanlinli17 commented Apr 12, 2025

Uh oh!

xuanlinli17 commented Apr 12, 2025

Uh oh!

xuanlinli17 commented Apr 12, 2025

Uh oh!

LukeLIN-web commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuanlinli17 commented Apr 12, 2025

Uh oh!

LukeLIN-web commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuanlinli17 commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuanlinli17 commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuanlinli17 commented Apr 12, 2025

Uh oh!

LukeLIN-web commented Apr 12, 2025

Uh oh!

xuanlinli17 commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukeLIN-web commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuanlinli17 commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukeLIN-web commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuanlinli17 commented Apr 12, 2025

Uh oh!

LukeLIN-web commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuanlinli17 commented Apr 12, 2025

Uh oh!

LukeLIN-web commented Apr 13, 2025

Uh oh!

xuanlinli17 commented Apr 13, 2025

Uh oh!

jasper0314-huang commented Apr 16, 2025

Uh oh!

xuanlinli17 commented Apr 16, 2025

Uh oh!

jasper0314-huang commented Apr 16, 2025

Uh oh!

LukeLIN-web commented Apr 16, 2025

Uh oh!

LukeLIN-web commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Boltzmachine commented Jun 9, 2025

Uh oh!

xuanlinli17 commented Jun 9, 2025

Uh oh!

Any speed up method since only run `pick_coke_can_variant_agg.sh` takes 40 mins #83

Any speed up method since only run `pick_coke_can_variant_agg.sh` takes 40 mins #83

LukeLIN-web commented Apr 11, 2025 •

edited

Loading

xuanlinli17 commented Apr 11, 2025 •

edited

Loading

LukeLIN-web commented Apr 12, 2025 •

edited

Loading

LukeLIN-web commented Apr 12, 2025 •

edited

Loading

xuanlinli17 commented Apr 12, 2025 •

edited

Loading

xuanlinli17 commented Apr 12, 2025 •

edited

Loading

xuanlinli17 commented Apr 12, 2025 •

edited

Loading

LukeLIN-web commented Apr 12, 2025 •

edited

Loading

xuanlinli17 commented Apr 12, 2025 •

edited

Loading

LukeLIN-web commented Apr 12, 2025 •

edited

Loading

LukeLIN-web commented Apr 12, 2025 •

edited

Loading

LukeLIN-web commented Apr 18, 2025 •

edited

Loading