I want to implement asynchronous communication with ncclSend/ncclRecv in different streams. For single process with multiple devices. My code is like following: On program initialization: GPU0 (thread 0): 1. ncclRecv(recv_buffer0, max_si...
The following actions use a deprecated Node.js version and will be forced to run on node20: actions/github-script@v6. For more info: https://github.blog/changelog/2024-03-07-github-actions-all-actions-will-run-on-node20-instead-of-node16-by-default/ Show more ...
openedonFeb 10, 2023 Here we find that our program with 2 cards is in nccl version >= v2.12.7-1, then send will hang, and we can see that the program hangs in the source code of nccl. When the nccl version is lower than 2.12.7-1, the program runs successfully. ...
support NCCL send/recv 04dc0db Member Author leofang commented Jul 7, 2020 Jenkins, test this please Collaborator pfn-ci-bot commented Jul 7, 2020 Successfully created a job for commit 04dc0db: Dashboard for commit 04dc0db leofang added 2 commits July 7, 2020 16:48 fix tests edd...
Hi there, I'm having a problem when programming with nccl. In fact, it is a question about what's the difference between NCCL_LAUNCH_MODE=GROUP/PARALLEL. Now the situation is: GPU0 is ncclSend data to GPU1 GPU1 is ncclRecv data from GPU0...
Hi, recently I try to use NCCL_MAX_NCHANNELS = 10 to limit nccl:all_to_all operation grid_size(SM counts) from torch/distributed/distributed_c10d.py(3881): all_to_all_single, but result shows that grid_size is 16, which is still larger t...
NCCL Blocking Send/Recv are Non-blocking in practice #42982 Sign in to view logs Summary Jobs assign Run details Usage Workflow file Triggered via issue July 9, 2024 17:53 andoorve commented on #129341 c6cce97 Status Success Total duration 10s Artifacts – assigntome-docathon....
Hi, I have a question about how P2P send/recv tasks are scheduled into kernel plans. It seems in scheduleP2pTasksToPlan NCCL schedules send/recv tasks in a group according to a sendOrder and recvOrder that all peers have consensus on, i.e., at i-th loop, if rank r2's recvOrder[i]...
Hello! I used some tracing tools to trace all-reduce operation in NCCL and found that the execution of runRing in all_reduce.h in GPU are always related to sendProxyProgress() in net.cc which seems to be related to CPU. I wonder whether you could kindly provide me some hints about ...
I would like to improve the bus bandwidth of sendrecv_perf by resetting NCCL_CHUNK_SIZE. But it's tricky that the result of sendrecv_perf went wrong. It would be highly appreciated if any hint. Thanks.Activity Sign up for free to join this conversation on GitHub. Already have an ...