2023/06/23放上arxiv。来自腾讯+中科大的多模态大语言模型综述。在收集的同时给出了一个对于大模型的评价标准。 GitHub - BradyFU/Awesome-Multimodal-Large-Language-Models: :sparkles::sparkles:Latest Papers …
本论文旨在追踪和总结多模态大语言模型(Multimodal Large Language Model)的最新进展,主要内容包括模型架构、训练策略和数据以及评估。然后,作者介绍了关于如何扩展多模态大语言模型以支持更多粒度、模态、语言和场景的研究主题。作者还介绍了多模态大语言模型面临的幻觉问题以及包括多模态上下文学习、多模态思维链、大语言模...
Vision-language pre-trainingLarge language modelAs an emerging task bridging vision and language, Language-grounded Multimodal 3D Scene Understanding (3D-LMSU) has attracted significant interest across various domains, such as robot navigation and human鈥揷omputer interaction. It aims to generate ...
综述一:A Survey on Multimodal Large Language Models 论文链接:https://arxiv.org/pdf/2306.13549.pdf 项目链接:https:///BradyFU/Awesome-Multimodal-Large-Language-Models 2024年4月1号更新的一篇paper。 一、多模态LLM的组成部分 常见的多模态LLM结构: 对于多模态输入-文本输出的典型 MLLM,其架构一般包括编码...
A Survey of Multimodal Large Language Model from A Data-centric PerspectiveO网页链接 这篇论文从以数据为中心的视角全面调查了多模态大型语言模型(MLLM)。人类通过视觉、嗅觉、听觉和触觉等多种感官感知世界,与此类似,多模态大型语言模型通过集成和处理来自文本、视觉、音频、视频和3D环境等多个模态的数据,增强了...
📌 What is This Survey About? In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the...
The first survey for Multimodal Large Language Models (MLLMs). ✨ Welcome to add WeChat ID (wmd_ustc) to join our MLLM communication group! 🌟 🔥🔥🔥MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models ...
Multimodal Large Language Model. The MLLM con- sists of a decoder-based language model FLLM to auto- regressively generate text responses following the user's in- puts, a vision encoder FV1 to extract features from the input image, and a linear projector ϕ to align ...
题目:A Survey on Multimodal Large Language Models 作者:Shukang Yin1*, Chaoyou Fu2∗‡†, Sirui Zhao1∗‡, Ke Li2, Xing Sun2, Tong Xu1, Enhong Chen1‡ 单位:School of CST., USTC & State Key Laboratory of Cognitive Intelligence 2Tencent YouTu Lab ...
Multimodal large language models (MLLMs), in particular, have emerged as a powerful framework, demonstrating impressive capabilities in tasks like image-text generation, visual question answering, and cross-modal retrieval. Despite these advancements, the complexity and scale of MLLMs introduce ...