...Parallelization Layouts for Large-Scale Distributed Model...
many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or seq