Why logging for rank=[0,1] only in function engine.py -> _save_checkpoint() #2067
Unanswered
dunalduck0
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Link to the source code
Line 2997 only produces logging for rank 0 and 1. In multi-node training, the local main process ranks can be other values. For example, 2 nodes, each with 8 GPUs. The rank of the main process of the 2nd node is normally 8, and there would be no logging for this rank. I think the logging is useful for all ranks other than just [0,1]
Beta Was this translation helpful? Give feedback.
All reactions