accelerator+save+state+save+path卡住

2025-01-12 22:22:01

拼音 [ 拼音 ]

Accelerator多机多卡save/load model问题 - 知乎

正常的torch代码,就直接torch.save(model, path) 或者 torch.save(model.state_dict(), path) 就行了。但是如果用多卡(单机多机都一样)跑的时候,不能直接这么保存。因为使用多卡,会对模型再进行一层的封装,也就是Module。如下图(引自解决pytorch多GPU训练保存的模型,在单GPU环境下加载出错问题 - 腾讯云开发者...
should accelerator.save_state be under accelerator.is_main...

it will hangs when accelerator.save_state using DeepSpeed with multi-gpus in one node. The question is that if the save_state should be under the main_process (is_main_process)? I have seen the save the model only for the main process when using the distributed training mode in pytorch....