Did you already see this issue (Warning: NaN or Inf found in input tensor #282)? Author ysujiang commented Dec 29, 2020 Are you using latest dev branch? Could you share your training config? Did you already see this issue (Warning: NaN or Inf found in input tensor #282)? yes, i ...
Rank 0: found NaN in local grad norm in backward pass before data-parallel communication collective. Device: 0 配置信息 using world size: 8, data-parallel size: 8, context-parallel size: 1 tensor-model-parallel size: 1, pipeline-model-parallel size: 1 ...