Add allReduce operator and cuda nccl allReduce kernel impl model parallel for resnet add allGather nccl kernel and operator Add allreduce allgather operator tests, change allgather kernel to output list of t
Add allReduce operator and cuda nccl allReduce kernel impl model parallel for resnet add allGather nccl kernel and operator Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output fix format of onnx.py use concat ...
Add allReduce operator and cuda nccl allReduce kernel impl model parallel for resnet add allGather nccl kernel and operator Add allreduce allgather operator tests, change allgather kernel to output list of tensor, fix shape infer, handle nullptr output ...