Sometimes a simple reboot can reset some unwanted processes. You can kill the process of training from htop as some process might not auto kill or use different master port. the training stopped at scanning val images. Just wait for few mins. It takes time to communicate between two servers...
[2024-06-27 09:57:59,229 C 1562488 1562488] plasma_store_provider.cc:73: Check failed: _s.ok() Bad status: IOError: Connection reset by peer the above error pops up . how should i resolve this and wt should i do next ? Member glenn-jocher commented Jun 27, 2024 Hi @kanis777...