CUDA driver is a stub library misc/strongstream.cc:60 NCCL WARN Cuda failure 'CUDA driver is a stub library' misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0..-1 range) nccl-test可以运行而horovod无法运行。 可能原因: Q: horovod默认与nccl静态链接 A: ...
2. 设置nccl_debug=info环境变量以获取更多调试信息的指导 为了获取更详细的调试信息,您可以设置环境变量 NCCL_DEBUG 为INFO 或更高级别的 WARN、ERROR(尽管通常 INFO 已经足够用于调试)。这可以通过在命令行中设置环境变量来实现,或者在您的代码中使用适当的 API 调用(如果有的话)来设置。 在命令行中设置(Linux/...
(run with NCCL_DEBUG=WARN for details) [rank1]: Traceback (most recent call last): [rank1]: File "/localnvme/application/sc_new/myy_world_consistency/LLaMA-Factory-0.9.0-9.27main/src/llamafactory/launcher.py", line 23, in <module> [rank1]: launch() [rank1]: File "/localnvme/...
�[36m(train_rft pid=2657296)�[0m 2024-05-11 14:10:52.227 n176-080-198:2657296:2658232 [0] include/alloc.h:102 NCCL WARN Cuda failure 1 'invalid argument' �[36m(train_rft pid=2657296)�[0m 2024-05-11 14:10:52.227 n176-080-198:2657296:2658232 [0] NCCL INFO transport/p...