解释use_cache=true与梯度检查点(gradient checkpointing)不兼容的原因 use_cache=true通常用于在序列生成任务(如文本生成)中缓存上一时间步的键值对(key-value pairs),以加速后续时间步的计算。这通过避免重复计算已经计算过的隐藏状态来实现,从而提高推理速度。 然而,梯度检查点(gradient checkpointing)是一种用于减少...
🐛 Describe the bug I am comparing the memory cost between use_reentrant=False and use_reentrant=True when using gradient checkpointing. When set use_reentrant=False, i find the peak memory is exactly the same with the one without using g...