CUDA_VISIBLE_DEVICES=1 deepspeed --num_gpus=1 ./finetune_trainer.py ... but it runs on GPU 0 ignoringCUDA_VISIBLE_DEVICES=1 Then I tried to use deepspeed launcher flags as explained here:https://www.deepspeed.ai/getting-started/#resource-configuration-multi-nodeand encountered multiple issues...
I expected deepspeed to inherit the specific GPU numeric assignments from CUDA_VISIBLE_DEVICES. It seems as though Deepspeed always re-indexes the assignments of the GPU's to start from 0. I believe this code snippet shows how the values given to the world_info dictionary are created (specific...
Python platform: Linux-5.14.0-427.42.1.el9_4.x86_64-x86_64-with-glibc2.34 Is CUDA available: True CUDA runtime version: 12.2.128 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 ...