Skip to content

[REQUEST] Exposing Parallelism rank as environment Variable for integration with profiling tools like CUPTI #3678

@WoosungMyung

Description

@WoosungMyung

First, thank you for your amazing work on this project and for being open to community feedback.

I would like to propose a small enhancement that could significantly improve compatibility with low-level profiling tools such as NVIDIA CUPTI or Intel VTune.

Motivation)

Tools like NVIDIA CUPTI rely on environment variables by Torch like RANK, WORLD_SIZE, or LOCAL_RANK to understand the role of each process in a distributed training job. However, It is not easy to know the DP Rank
from accelerate easily.

Proposion)

After initializing the process group (e.g., via torch.distributed.init_process_group or Accelerator()),
each process can set the following environment variables (locally) just once:

import torch.distributed as dist

if dist.is_initialized():
    os.environ["DP_RANK"] = str(dist.get_rank())
    os.environ["DP_WORLD_SIZE"] = str(dist.get_world_size())

This would allow external profilers to correlate collected GPU kernel events with the Data Parallel rank of each process, which makes User understanding the bottleneck of Training.

Thank you once again for your time and for considering this suggestion.
I'd be happy to submit a PR if this is aligned with your design direction.

Best regards,
[Woosung Myung]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions