Pixtral 12B; Pravesh Agrawal et al RECONSTRUCTIVE VISUAL INSTRUCTION TUNING; Haochen Wang et al DEEM: DIFFUSION MODELS SERVE AS THE EYES OF LARGE LANGUAGE MODELS FOR IMAGE PERCEPTION; Run Luo et al Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models; Weixi...