DeepSpeed does not implement model parallelism but is compatible with existing forms like tensor slicing and pipeline parallelism. However, zero stage 3 should reduce per-gpu memory consumption of model parameters and optimizer state. Offloading should also reduce gpu memory consumption by moving ...