Lightning-AI / pytorch-lightning Public Notifications Fork 3.4k Star 28.7k New issue Jump to bottom code stuck at All DDP processes registered #9641 Closed derrick-xwp opened this issue Sep 22, 2021· 15 comments Closed code stuck at All DDP processes registered #9641 derrick-xwp ...
The DDP training stuck at the 1st iter, and it's always waiting for pid: os.waitpid() always return pid==0 What version are you seeing the problem on? v1.x How to reproduce the bug No response Error messages and logs # Error messages and logs here please Environment torch2.1.0+cud...
Fixed a bug where auto-upgrading to the latest lightning via the CLI could get stuck in a loop (#15984) Pytorch Fixed the XLAProfiler not recording anything due to mismatching of action names (#15885) Full Changelog: 1.8.4...1.8.4.post0 Assets 10 Loading 👍 1 1 person reacted De...
On my server node, training a LightningModule using DDP leads to a freeze, even before entering the training loop. The node has 2 GPUs and the freeze occurs indepently of whether acceleator is set to "gpu" or "cpu". Notably, on my local machine, running trainer = pl.Trainer(devices=2...
Bug description The training code simply gets stuck on the TPU. What version are you seeing the problem on? master How to reproduce the bug Just used the following calls to trainer and fit. pl.seed_everything(7, workers=True) torch.set_f...
Trainig stuck before first epoch with ddp and multi-gpu #11910 Closed Eralien commented Mar 3, 2022 Same here. Non DP/DDP training has no problem whatsoever. 👍 1 Eralien mentioned this issue Mar 3, 2022 Distributed training hangs at model checkpoint #10947 Closed YuFan-Microsoft ...
metrics csv in ddp modebugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.2.x #20371 openedOct 29, 2024byruyanyinian FutureWarning:torch.cuda.amp.custom_bwd(args...)is deprecated. Please usetorch.amp.custom_bwd(args..., device_type='cuda')instead.bugSomething...
🐛 Bug It is not possible to create and use the Trainer class more than once with the DDP backend since the program crashes the second time with RuntimeError: Address already in use. To Reproduce Here is a minimal code example which repro...
PyTorch Lightning Bolts: Implementation by the Lightning team. SwAV-TF: A TensorFlow re-implementation. Citation If you find this repository useful in your research, please cite: @article{caron2020unsupervised, title={Unsupervised Learning of Visual Features by Contrasting Cluster Assignments}, author=...
🐛 Bug My training / validation step gets hung when using ddp on 4-GPU AWS instance. Usually it happens at the end of the first epoch, but sometimes in the middle of it. Code runs fine on 1 GPU. My model checkpoint is a very basic set up ...