Describe the bug I tried to train a ControlNet, with both DeepSpeed Stage-3and gradient checkpointing, but unexpected errors will occur. There is no problem using either of these alone, the errors seems to happen in the loss backforward:...
Describe the bug During Step 2 - Reward Model of DeepSpeed-Chat, an AssertionError occurs in the backward process for ZeRO stage 3 if gradient_checkpointing is enabled, while it works if gradient_checkpointing is disabled Log output Traceback (most recent call last): File"run_bloom.py", li...
Fixes a bug for which if gradient checkpointing is enabled, SeamlessM4Tv2ConformerEncoderLayer.forward() is called with some missing arguments. Fixes #31028 Before submitting This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). Did you read the...