\label {eq:vid} (3) Overall, the full objective is L = λMSMLMSM +λRELLREL + λVIDLVID, where λs balances the losses. 3.3. Improved Mask-Predict for Video Generation We employ mask-predict [23] during inference, which it- eratively remasks and repredicts low...