Query Regarding ZeRO-1 in ColossalAI Not Sharding Optimizer State #4328
Unanswered
yhna940
asked this question in
Community | Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have been recently studying the ZeRO-1 strategy implemented by ColossalAI and have noticed something that seems quite unusual. As per my understanding, ColossalAI employs the LowLevelZeroOptimizer for its ZeRO-1 strategy.
According to the relevant literature, ZeRO-1 should shard the optimizer state, akin to what is done in fairscale's OSS or torch's Zero Redundancy. However, as I was perusing through the inner workings of the LowLevelZeroOptimizer, I couldn't find any section where the optimizer's state is sharded. I was able to confirm that it shards the gradients and parameters but not the optimizer state.
I am seeking verification regarding my understanding of this matter. Is it indeed the case that ColossalAI's ZeRO-1 doesn't shard the optimizer state or am I missing something? I would appreciate any insights or clarifications that you can provide.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions