Skip to content

Any speed up method since only run pick_coke_can_variant_agg.sh takes 40 mins #83

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
LukeLIN-web opened this issue Apr 11, 2025 · 28 comments

Comments

@LukeLIN-web
Copy link

LukeLIN-web commented Apr 11, 2025

I don't use all coke_can_options_arr # declare -a coke_can_options_arr=("lr_switch=True" "upright=True" "laid_vertically=True")
only declare -a coke_can_options_arr=("upright=True")
But it still takes 40 mins. Is there any speed up method? Or is there any smaller test case?

I try to only run first block
scene_name=google_pick_coke_can_1_v4 and don't run the others like declare -a scene_arr=("Baked_sc1_staging_objaverse_cabinet1_h870" But i cannot get standing sim variant avg success
how can i get standing sim variant avg success when I don't run all blocks?

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Apr 11, 2025

It's quite strange. Looks like GPU might be unused during eval (pure cpu eval is very slow). What GPU are you using and what's gpu utilization?

You need to keep all scenes to get standing sim variant avg success.

@LukeLIN-web
Copy link
Author

It's quite strange. Looks like GPU might be unused during eval (pure cpu eval is very slow). What GPU are you using and what's gpu utilization?

You need to keep all scenes to get standing sim variant avg success.

Hi, it is Compute 0% 16214MiB 17% 0% 3870MiB python simpler_env/main_inference.py Graphic 11% 16214MiB 17% 0% 3931MiB python simpler_env/main_inference.py --policy-model Maybe they are not use it not use to compute. I am using H100.

@xuanlinli17
Copy link
Collaborator

Do you have the correct tensorflow version?

pip install tensorflow==2.15.0
pip install -r requirements_full_install.txt
pip install tensorflow[and-cuda]==2.15.1 # tensorflow gpu support

@xuanlinli17
Copy link
Collaborator

@xuanlinli17
Copy link
Collaborator

If model is not using GPU, a warning will be raised.

@LukeLIN-web
Copy link
Author

LukeLIN-web commented Apr 12, 2025

Do you have the correct tensorflow version?

pip install tensorflow==2.15.0
pip install -r requirements_full_install.txt
pip install tensorflow[and-cuda]==2.15.1 # tensorflow gpu support

Thank you! Did you have a advice for torch version? I am face this problem #30 (comment) . I am using torch 2.2.0 but is incompatible with tensorflow[and-cuda]==2.15.1 depency cuda lib.

@xuanlinli17
Copy link
Collaborator

I don't think any part of the existing code uses torch? If so you can just insteall cpu torch I think. I locally have torch 2.3.1 but I haven't tested it for a long time.

@LukeLIN-web
Copy link
Author

LukeLIN-web commented Apr 12, 2025

I don't think any part of the existing code uses torch? If so you can just insteall cpu torch I think. I locally have torch 2.3.1 but I haven't tested it for a long time.

Thank you. torch 2.3.1 works fine for my openvla! I am using tensorflow 2.15.1
But the speed is still slow, three mins for 11 run_maniskill2_eval_single_episode

pick coke can
OrderedDict([('n_lift_significant', 0), ('consec_grasp', False), ('grasped', False)])
.... 11 times all false
pick coke can
OrderedDict([('n_lift_significant', 0), ('consec_grasp', False), ('grasped', False)])

, I think it takes long in rendering. Graphic 3-4x time longer than computing. I will try to not produce videos.

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Apr 12, 2025

All policy take RGB images as input so RGB images are always rendered. If graphics is the problem then I think ffmpeg might be causing the issue. In this case (for basically all ffmpeg-caused slow video saving issues) it means there is a lack of system memory (and if you use a cluster, other jobs might be taking too much memory).

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Apr 12, 2025

Generally on 4090 a single episode of pick coke can is like 5 seconds for RT-1

@xuanlinli17
Copy link
Collaborator

Pick coke can uses rasterization so there's no ray tracing so env shouldn't be slow on non-RTX gpus; you can try to bench time for each component to see what's the slowest.

@LukeLIN-web
Copy link
Author

All policy take RGB images as input so RGB images are always rendered. If graphics is the problem then I think ffmpeg might be causing the issue. In this case (for basically all ffmpeg-caused slow video saving issues) it means there is a lack of system memory (and if you use a cluster, other jobs might be taking too much memory).

My memory should be fine: 215GB /756GB. Did you means there is not option to (doesn't produce videos)?

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Apr 12, 2025

If so then ffmpeg shouldn't be the bottleneck. It might take several seconds to save 11 videos.

@LukeLIN-web
Copy link
Author

LukeLIN-web commented Apr 12, 2025

Sorry I forget mention an important thing, maybe it is because my every episode is fail , so it reach the max steps, so it is slow?

        raw_action, action = model.step(image, task_description)
        predicted_actions.append(raw_action)
        predicted_terminated = bool(action["terminate_episode"][0] > 0)
        if predicted_terminated:
            if not is_final_subtask:
                # advance the environment to the next subtask
                predicted_terminated = False
                env.advance_to_next_subtask()

model.step Step time: 0.039

        env_start_time = time.time()
        # step the environment
        obs, reward, done, truncated, info = env.step(
            np.concatenate(
                [action["world_vector"], action["rot_axangle"], action["gripper"]]
            ),
        )
        env_end_time = time.time()
        print(f"Env step time: {env_end_time - env_start_time}")

Env step time: 0.048


    video_start_time = time.time()
    for k, v in additional_env_build_kwargs.items():
        env_save_name = env_save_name + f"_{k}_{v}"
    .... 
    video_path = os.path.join(logging_dir, video_path)
    write_video(video_path, images, fps=5)
    video_end_time = time.time()
    print(f"Video write time: {video_end_time - video_start_time}")

    action_start_time = time.time()
    # save action trajectory
    action_path = video_path.replace(".mp4", ".png")
    action_root = os.path.dirname(action_path) + "/actions/"
    os.makedirs(action_root, exist_ok=True)
    action_path = action_root + os.path.basename(action_path)
    model.visualize_epoch(predicted_actions, images, save_path=action_path)
    action_end_time = time.time()
    print(f"Action save time: {action_end_time - action_start_time}")

Video write time: 0.2038
Action save time: 0.659

one eposide Time taken: 10.73 on H100

@xuanlinli17
Copy link
Collaborator

xuanlinli17 commented Apr 12, 2025

Envs step (max episode 75 steps) 0.039*75 < 3s, so I think the policy forward takes ~7s; you can check

@LukeLIN-web
Copy link
Author

LukeLIN-web commented Apr 12, 2025

Envs step (max episode 75 steps) 0.039*75 < 3s, so I think the policy forward takes ~7s; you can check

The total while not (predicted_terminated or truncated): loops takes 7.29 - 8.37 s.
model.step() takes 0.039s/step
env.step( takes 0.048s/ step

0.039+0.048=0.087

Total timestep: 80, 0.087*80=6.96

@xuanlinli17
Copy link
Collaborator

But it doesn't add up to 10s?

@LukeLIN-web
Copy link
Author

LukeLIN-web commented Apr 12, 2025

First prepare part takes 1.76-1.91s

    if additional_env_build_kwargs is None:
        additional_env_build_kwargs = {}

    # Create environment
  ..... 
    # Initialize logging
    image = get_image_from_maniskill2_obs_dict(env, obs, camera_name=obs_camera_name) #可以用wrist吗? 
    images = [image]
    predicted_actions = []
    predicted_terminated, done, truncated = False, False, False

    # Initialize model
    model.reset(task_description)

    timestep = 0
    success = "failure"

Second part, The total while not (predicted_terminated or truncated): loops takes 7.29 - 8.37 s.

third part, save video and action takes around 0.85 s

so around 1.76+7.5+0.85=10.11

@xuanlinli17
Copy link
Collaborator

Yeah for the same env the 1.76s can be saved by not re-creating the env and just reset the env w/ different robot & object pose ; but the majority of time is still 50% policy forward and 50% env step

@LukeLIN-web
Copy link
Author

Yeah for the same env the 1.76s can be saved by not re-creating the env and just reset the env w/ different robot & object pose ; but the majority of time is still 50% policy forward and 50% env step

Thank you for your time! I will keep thinking

@xuanlinli17
Copy link
Collaborator

Essentially the way to speed up both model inference & env is via parallelizing envs; it's already done in ManiSkill3 for widowx envs but not yet for google robot envs.

@jasper0314-huang
Copy link

Hi @xuanlinli17 @LukeLIN-web,
I have a question that might be stupid, can we return early from an episode once success has been achieved?

For example, if success becomes True at the 12th step, all the remaining steps seem unnecessary.
Would it be okay to just break out of the loop at that point?

Image

@xuanlinli17
Copy link
Collaborator

Yes, you can modify the evaluation code to return early.

@jasper0314-huang
Copy link

Got it. Thanks!

@LukeLIN-web
Copy link
Author

Hi @xuanlinli17 @LukeLIN-web,  你好 I have a question that might be stupid, can we return early from an episode once success has been achieved?我有一个可能很愚蠢的问题,一旦成功,我们可以提前从发作中回来吗?

For example, if success becomes True at the 12th step, all the remaining steps seem unnecessary.例如,如果 success 在第 12 步变为 True,则其余所有步骤似乎都是不必要的。 Would it be okay to just break out of the loop at that point?在那个时候跳出循环可以吗?

Image

great idea!

@LukeLIN-web
Copy link
Author

LukeLIN-web commented Apr 18, 2025

save video and action takes around 0.85 s

And we have to save video otherwise cannot count metrics now .

I eval googlerobot

tasks = [
    "pick_coke_can_visual_matching.sh",
    "pick_coke_can_variant_agg.sh",
    "move_near_variant_agg.sh",
    "move_near_visual_matching.sh",
    "drawer_visual_matching.sh",
    "drawer_variant_agg.sh",
]

It takes me 16 hours in A6000 to eval googlerobot, really sad.

And it takes 9.0G to store the generated MP4.

@Boltzmachine
Copy link

Is it ok to parallelize it? I am worried that there are some issues when running multiple instances on a machine. (I have experience of other softwares that if you run multiple instances there are bugs)

@xuanlinli17
Copy link
Collaborator

Bridge envs are parallelized in ManiSkill3. Google Robot envs tbd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants