报告嘉宾:Haohan WANG (University of Illinois Urbana-Champaign)报告时间:2024年1月3日 (星期三)晚上20:30 (北京时间)报告题目:Guardian of Trust in Language Models: Automatic Jailbreak and Systematic Defense报告人简介:Haohan Wang is an, 视频播放量 802、弹
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models arXiv 2023-11-28 Github Coming soon LLMGA: Multimodal Large Language Model based Generation Assistant arXiv 2023-11-27 Github Demo ChartLlama: A Multimodal LLM for Chart Understanding and Generation arXiv 2023-11-27 Github ...
Fig. 2: The Cartesian coordinate system used to present language in 3D. The parameters of a transcript mapped out onto x, y and z-axes. Full size image The 3D models emerged from a translation of linguistic data into a visual programming language (Grasshopper integrated with Rhino’s 3-D ...
A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of ...
A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representati...
针对resource-hungry video understanding,主要有三个方面的原因: 1)相对于图片文本对,视频文本对更难收集,而且可能有不对齐风险; 2)处理视频这类任务需要更多的计算资源; 3)视频具有时序信息,向强大的I-VL model中增加时序处理模块,来处理视频理解任务是非常直觉的做法。
To this end, we propose a novel approach called Differentiable rendering-based multi-view Image–Language Fusion (DILF) for zero-shot 3D shape understanding. Specifically, DILF leverages large-scale language models (LLMs) to generate textual prompts enriched with 3D semantics and designs a ...
ProteinLMDataset,ProteinLMBench,2024.06A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding,arXiv DUD-E,2012.06Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking,Journal of Medicinal Chemistry ...
As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we...
intensive and expensive process. While it's not the focus of this course, it's important to have a solid understanding of how models are pre-trained, especially in terms of data and parameters. Pre-training can also be performed by hobbyists at a small scale with <1B models. ...