You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, thank you for your amazing work on this project and for being open to community feedback.
I would like to propose a small enhancement that could significantly improve compatibility with low-level profiling tools such as NVIDIA CUPTI or Intel VTune.
Motivation)
Tools like NVIDIA CUPTI rely on environment variables by Torch like RANK, WORLD_SIZE, or LOCAL_RANK to understand the role of each process in a distributed training job. However, It is not easy to know the DP Rank
from accelerate easily.
Proposion)
After initializing the process group (e.g., via torch.distributed.init_process_group or Accelerator()),
each process can set the following environment variables (locally) just once:
import torch.distributed as dist
if dist.is_initialized():
os.environ["DP_RANK"] = str(dist.get_rank())
os.environ["DP_WORLD_SIZE"] = str(dist.get_world_size())
This would allow external profilers to correlate collected GPU kernel events with the Data Parallel rank of each process, which makes User understanding the bottleneck of Training.
Thank you once again for your time and for considering this suggestion.
I'd be happy to submit a PR if this is aligned with your design direction.