MMLU MMLU是一个用于评测自然语言理解能力的英文评测数据集,是当今用于评测大模型能力的主要数据集之一,其中验证集和测试集分别包含1.5K和14.1K个选择题,涵盖57个学科。MMLU推理代码请参考本项目:📖GitHub Wiki ModelsValid (0-shot)Valid (5-shot)Test (0-shot)Test (5-shot) Llama-3-Chinese-8B-Instruct-v3...
2023 CMMLU: Measuring massive multitask language understanding in Chinese Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, Tim Baldwin 2023 Extensive Self-Contrast Enables Feedback-Free Language Model Alignment ...
CMMLU是一个综合性的中文评估基准,专门用于评估语言模型在中文语境下的知识和推理能力。CMMLU涵盖了从基础学科到高级专业水平的67个主题。它包括:需要计算和推理的自然科学,需要知识的人文科学和社会科学,以及需要生活常识的中国驾驶规则等。此外,CMMLU中的许多任务具有中国特定的答案,可能在其他地区或语言中并不普遍适...
CMMLU: Measuring massive multitask language understanding in Chinese Haonan Li, Yixuan Zhang, Fajri Koto, Yifei Yang, Hai Zhao, Yeyun Gong, Nan Duan, Tim Baldwin 2023 Evaluating the Performance of Large Language Models on GAOKAO Benchmark ...
As the capabilities of large language models (LLMs) continue to advance, evaluating their performance becomes increasingly crucial and challenging. This paper aims to bridge this gap by introducing CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural science, social ...
For example, when thebenchmarked tasksrevolved around common reasoning and reading comprehension, the 01.AI model delivered scores of 80.1 and 76.4, while Llama 2 followed closely with scores of 71.9 and 69.4. Even on the MMLU (massive multitask language understanding) benchmark, the Chinese mod...
CMMLU是另一个综合性中文评测数据集,专门用于评估语言模型在中文语境下的知识和推理能力,涵盖了从基础学科到高级专业水平的67个主题,共计11.5K个选择题。CMMLU推理代码请参考本项目:📖GitHub Wiki LLaMA ModelsTest (0/few-shot)Alpaca ModelsTest (0/few-shot) Chinese-LLaMA-2-13B 38.9 / 42.5 Chinese-Alpac...
CMMLU:综合性的中文评估基准。涵盖了从基础学科到高级专业水平的67个主题。专门用于评估语言模型在中文语境下的知识和推理能力。 MMLU:包含了57个子任务的英文评测数据集。涵盖了从初等数学、美国历史、计算机科学到法律等多个领域,难度覆盖高中水平到专家水平,有效地衡量了模型在人文、社科和理工等多个学科大类中的综...
https://en.m.wikipedia.org/wiki/List_of_varieties_of_Chinese ↑This map is attached on this link. ↓Chinese language (including dialects) is explained (in English) in this video. https://youtu.be/QY0AMmLuiqk|China have a lot of dialects|There are so many
与英文的MMLU测试类似,我们的测试方法不需要大量的训练数据集。我们假设模型已经通过阅读互联网上大量多样化的文本获取了必要的知识,这个过程通常称为预训练。人类主要通过阅读书籍、听讲座和完成练习来学习新知识。因此,我们提供了少数样本测试模式,并为每个任务提供了开发集和测试集。开发集用于少数样本提示,而测试集用于...