ImportError: cannot import name 'default_pg_timeout' from 'torch.distributed' (/Users/{USER_NAME}/miniforge3/envs/{ENV}/lib/python3.11/site-packages/torch/distributed/__init__.py) Indeed, when I trace back totorch.distributed, the following also throws an error: >>> from torch.distributed...
RuntimeError: [1] is setting up NCCL communicator and retreiving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Timeout waiting for key: default_pg/0/0 after 1800000 ms Exception raised from get at ../torch/csrc/distributed/c10d/FileSt...
--- 1 | pg01 | primary | * running | | default | host=pg01 user=repmgr dbname=repmgr port=19200 connect_timeout=2 2 | pg02 | standby | running | pg01 | default | host=pg02 user=repmgr dbname=repmgr port=19200 connect_timeout=2 A、当前pg01是主节点 B、数据库端口为19200 C、主...
🐛 Describe the bug I am running librispeech recipe with distributed mode using slurm on esonet2. i am running on two oracle instance each one has single gpu (Tesla V100). but when i ran stage 11 it created jobs on both machine and gpu me...