Model Parallel - Sharding Model Parameters #3272
Replies: 1 comment
-
These GPU issues turned out to be unrelated to the model parameter sharding script (despite appearing to occur after script execution). Closing as a result. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Flax Community,
I am a novice at sharding model parameters across devices, I attempted to do so based on the following guide. Unfortunately I am encountering and currently debugging GPU errors resulting from running my training script with model parameter sharding.
I attempted a clean install of nvidia software and a reboot but this results in the following output from
nvidia-smi
:Does anyone have a resource on best practices for sharding sets of model parameters? I want to avoid encountering this issue again in future, any advice or resources for model parameter sharding would be much appreciated.
Update:
I will include my learnings here once I have come to a resolution on this. At the moment this is lower priority for my project so it may be a while before I update.
Beta Was this translation helpful? Give feedback.
All reactions