MMT-Bench is a comprehensive benchmark designed to evaluate Large Vision-Language Models (LVLMs) across a wide array of multimodal tasks that require expert knowledge as well as deliberate visual recognition, localization, reasoning, and planning¹. It includes 31,325 meticulously curated multi-choi...
结果显示MMT-Bench的基准测试给现有的LVLMs带来了重大挑战,即使是InternVL-Chat、GPT-4o和GeminiProVision等先进模型,其准确率也仅分别为63.4%、65.5%和61.6%。 综合而言,闭源的专有模型GPT-4o目前在MMT-Bench中取得了领先地位,超过了InternVL-chat、QWen-VL-Plus、GPT-4V和GeminiProVision等其他模型。 值得注意的...
MMT-Bench: A Multimodal MultiTask Benchmark for Comprehensive Evaluation of Large Vision-Language Models Kaining Ying*, Fanqing Meng*, Jin Wang*, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, jiayi lei, Quanfeng Lu, Peng Gao, Runjian Chen, Peng Xu, Renr...
Evaluation results involving 30 30 LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at...