[REQUEST] Exposing Parallelism rank as environment Variable for integration with profiling tools like CUPTI

First, thank you for your amazing work on this project and for being open to community feedback.

I would like to propose a small enhancement that could significantly improve compatibility with low-level profiling tools such as NVIDIA CUPTI or Intel VTune.

Motivation)

Tools like NVIDIA CUPTI rely on environment variables by Torch like RANK, WORLD_SIZE, or LOCAL_RANK to understand the role of each process in a distributed training job. However, It is not easy to know the DP Rank
from accelerate easily.

Proposion) 

After initializing the process group (e.g., via torch.distributed.init_process_group or Accelerator()),
each process can set the following environment variables (locally) just once:

```
import torch.distributed as dist

if dist.is_initialized():
    os.environ["DP_RANK"] = str(dist.get_rank())
    os.environ["DP_WORLD_SIZE"] = str(dist.get_world_size())
```

This would allow external profilers to correlate collected GPU kernel events with the Data Parallel rank of each process, which makes User understanding the bottleneck of Training.

Thank you once again for your time and for considering this suggestion.
I'd be happy to submit a PR if this is aligned with your design direction.

Best regards,
[Woosung Myung]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[REQUEST] Exposing Parallelism rank as environment Variable for integration with profiling tools like CUPTI #3678

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[REQUEST] Exposing Parallelism rank as environment Variable for integration with profiling tools like CUPTI #3678

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions