The rapid evolution of Large Language Models (LLMs) highlights the necessity for ethical considerations and data integrity in AI development, particularly emphasizing the role of FAIR (Findable, Accessible, Interoperable, Reusable) data principles. While these principles are crucial for ethical data ...
An open, large-scale biomolecular instruction dataset for large language models. 📃Paper• ⏬Dataset 🆕 News 📌 Contents 1. Overview 📊 1.1 Data Stats Mol-Instructionscomprises three cardinal components: 🔬Molecule-oriented instructions:This component delves into the world of small molecules...
一、三模态(文本、图像和语音) 1.《How2: A Large-scale Dataset for Multimodal Language Understanding》--【多模态自动语音识别、多模态机器翻译、语音文本翻译、多模态总结(Summarization)】 How2 是一个大规模的多模态数据集,涵盖了80000个视频片段(约2000小时)的各种主题的大型教学视频数据集,使用单词级别的时...
一、三模态(文本、图像和语音) 1.《How2: A Large-scale Dataset for Multimodal Language Understanding》--【多模态自动语音识别、多模态机器翻译、语音文本翻译、多模态总结(Summarization)】 How2 是一个大规模的多模态数据集,涵盖了80000个视频片段(约2000小时)的各种主题的大型教学视频数据集,使用单词级别的时...
Mol-Instructions48KEnglishAn open, large-scale biomolecular instruction dataset for large language models.CC BY 4.0 RefGPT50KEnglish,chinesewe introduce a cost-effective method called RefGPT, which generates a vast amount of high-quality multi-turn Q&A content.- ...
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language Jun Xu, Tao Mei, Ting Yao, Yong Rui June 2016 Published by IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) Download BibTex While there has been increasing interest in...
Cloze tests are widely adopted in language exams to evaluate students' language proficiency. In this paper, we propose the first large-scale human-created cloze test dataset CLOTH, containing questions used in middle-school and high-school language exams. With missing blanks carefully created by tea...
In this paper, we introduce the Chinese AI and Law challenge dataset (CAIL2018), the first large-scale Chinese legal dataset for judgment prediction. CAIL2018 con- tains more than 2.6 million criminal cases published by the Supreme People's Court of China, which are several times larger than...
then in MEG, which are well suited to studying the dynamic processing of language comprehension, said the research article. In addition, the dataset, comprising a large vocabulary from stories with various topics, can serve as a brain benchmark to evaluate and improve computational language models...
LLaVA-Bench is a dataset created to evaluate the capability of large multimodal models (LMM) in more challenging tasks and generalizability to novel domains. It consists of a diverse set of 24 images with 60 questions in total, including indoor and outdo