正常的torch代码,就直接torch.save(model, path) 或者 torch.save(model.state_dict(), path) 就行了。但是如果用多卡(单机多机都一样)跑的时候,不能直接这么保存。因为使用多卡,会对模型再进行一层的封装,也就是Module。如下图(引自解决pytorch多GPU训练保存的模型,在单GPU环境下加载出错问题 - 腾讯云开发者...
it will hangs when accelerator.save_state using DeepSpeed with multi-gpus in one node. The question is that if the save_state should be under the main_process (is_main_process)? I have seen the save the model only for the main process when using the distributed training mode in pytorch....