AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 9 != 1 * 3 * 1 To Reproduce Steps to reproduce the behavior: Run the following script on a Ray cluster with 3 nodes, each hosting 1 NVIDIA GPU A100 ...
train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 256 != 4 * 8 * 1 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 91809) of binary: /home/ubuntu/anaconda3/envs/chat/bin/python when I run ...