Skip to content

Potential memory leak on Maniskill3? #87

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
IrvingF7 opened this issue Apr 28, 2025 · 6 comments
Open

Potential memory leak on Maniskill3? #87

IrvingF7 opened this issue Apr 28, 2025 · 6 comments

Comments

@IrvingF7
Copy link

Hi!

Thanks for open-sourcing this awesome project.

Recently, I switched to the maniskill3 branch, and I noticed that I have been getting OOM issues if I switch between too many envs.

My workflow is roughly as follows:

I make env A, do some parallel testing, make sure to close and delete this env by calling env.close() and del env, and make env B, rinse and repeat. All in a single process.

But I noticed that the VRAM does not drop when an env is closed.

Image

I double-checked the time, and the time at which VRAM increases is indeed the time a new env is made.

The specific error message I got is this

RuntimeError: Unable to create GPU parallelized camera group. If the error is about being unable to create a buffer, you are likely using too many Cameras. Either use less cameras (via less parallel envs) and/or reduce the size of the cameras. Another common cause is using a memory intensive shader, you can try using the 'minimal' shader which optimizes for GPU memory but disables some advanced functionalities. Another option is to avoid rendering with the rgb_array mode / using the human render cameras as they can be more memory intensive as they typically have higher resolutions for the purposes of visualization.

I came across this issue on Maniskill's main repo, but it seems like the API to manually clear the assets is no longer in the codebase anymore

@xuanlinli17
Copy link
Collaborator

@StoneT2000

@StoneT2000
Copy link
Collaborator

Could try running

import gc; gc.collect() after deleting the environment?

And what other code are you running besides the environment?

@StoneT2000
Copy link
Collaborator

And what version of maniskill 3 is being used? git? pypi? nightly?

@IrvingF7
Copy link
Author

IrvingF7 commented May 2, 2025

Could try running

import gc; gc.collect() after deleting the environment?

And what other code are you running besides the environment?

Hi Stone!

Thanks for the reply. Yes, I did include gc in my code, and that doesn't change the result, which surprises me

The maniskill3 version I was using is 3.0.0b20. I don't quite remember how I installed it tho, sorry.

The code I was running works like this. On the websocket client side, an evaluator object is spawned and queries Simpler/Maniskill3 to get images and robot states. It then packs this data to send to a websocket server

The server is simply a VLA model that accepts input, generates actions, and sends them back to the client/evaluator to execute.

Currently, all things run on one single machine, but the architecture is written so that in the future I can run inference separately from the robot.

The code looks like this

  def evaluate(self):
      '''Run evaluation on all tasks in the task list'''

      for gradient_step in self.gradient_steps:
          self._initialze_model_client(gradient_step)
          for task_name in self.task_lists:
              self.evaluate_task(task_name)

          if self.use_wandb:
              wandb.log(self.wandb_metrics, step=int(gradient_step), commit=True)

  @override
  def evaluate_task(self, task_name):
      '''
      Evaluate a single task

      Args:
          task_name: Name of the task to evaluate

      Returns:
          success_rate: The success rate achieved on this task
      '''
      start_task_time = time.time()
      task_seed = self.seed
      # Initialize task-specific logging
      task_log_dir = self.log_dir / task_name
      video_dir = task_log_dir / "videos"
      if self.main_rank:
          os.makedirs(video_dir, exist_ok=True)

      task_logger = setup_logger(
          main_rank=self.main_rank,
          filename=task_log_dir / f"{task_name}.log" if not self.debug else None,  # log to console when debug is True
          debug=self.debug,
          name=f'{task_name}_logger'
      )

      task_logger.info(f"Task suite: {task_name}")
      self.main_logger.info(f"Task suite: {task_name}")

      # Set up environment
      ms3_task_name = self.ms3_translator.get(task_name, task_name)

      env: BaseEnv = gym.make(
      ms3_task_name,
      obs_mode="rgb+segmentation",
      num_envs=self.n_parallel_eval,
      sensor_configs={"shader_pack": "default"},)

      cnt_episode = 0
      eval_metrics = collections.defaultdict(list)

      # Set up receding horizon control
      action_plan = collections.deque()

      while cnt_episode < self.n_eval_episode:
          task_seed = task_seed + cnt_episode
          obs, _ = env.reset(seed=task_seed, options={"episode_id": torch.tensor([task_seed + i for i in range(self.n_parallel_eval)])})
          instruction = env.unwrapped.get_language_instruction()

          images = []
          predicted_terminated, truncated = False, False
          images.append(get_image_from_maniskill3_obs_dict(env, obs).cpu().numpy())
          elapsed_steps = 0
          while not (predicted_terminated or truncated):
              if not action_plan:
                  # action horizon is all executed
                  # Query model to get action
                  element = {
                          "observation.images.top": images[-1],
                          "observation.state": obs['agent']['eef_pos'].cpu().numpy(),
                          "task": instruction
                          }
                  action_chunk = self.client.infer(element)

                  # action chunk is of the size [batch, action_step, action_dim]
                  # but dequeue can only take something like [action_step, batch, action_dim]
                  action_plan.extend(action_chunk[:, :self.action_step, :].transpose(1, 0, 2))

              action = action_plan.popleft()
              obs, reward, terminated, truncated, info = env.step(action)
              elapsed_steps += 1
              info = common.to_numpy(info)

              truncated = bool(truncated.any()) # note that all envs truncate and terminate at the same time.
              images.append(get_image_from_maniskill3_obs_dict(env, obs).cpu().numpy())

          for k, v in info.items():
              eval_metrics[k].append(v.flatten())

          if self.pipeline_cfg.eval_cfg.recording:
              for i in range(len(images[-1])):
                  # save video. The naming is ugly but it's to follow previous naming scheme
                  success_string = "_success" if info['success'][i].item() else ""
                  images_to_video([img[i] for img in images], video_dir, f"video_{cnt_episode + i}{success_string}", fps=10, verbose=True)

          cnt_episode += self.n_parallel_eval


      mean_metrics = {k: np.mean(v) for k, v in eval_metrics.items()}
      success_rate = mean_metrics['success']
      task_eval_time = time.time() - start_task_time

      # log results
      self._log_summary(logger=task_logger,
                             cnt_episode=cnt_episode,
                             eval_time=task_eval_time,
                             success_rate=success_rate)

      self._log_summary(logger=self.main_logger,
                             cnt_episode=cnt_episode,
                             eval_time=task_eval_time,
                             success_rate=success_rate)

      if self.use_wandb:
          self.wandb_metrics[task_name] = success_rate

      env.close()
      del env
      gc.collect()
      torch.cuda.empty_cache()

@StoneT2000
Copy link
Collaborator

I see. Could you try pip uninstall maniskill and then install mani-skill-nightly?

@IrvingF7
Copy link
Author

IrvingF7 commented May 2, 2025

I see. Could you try pip uninstall maniskill and then install mani-skill-nightly?

Thanks. I will report back when I get home and have time to test. For the time being I am using multi-processing to speed up ms2-based simpler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants