MAP-Neo 和 MMLU-Pro 的部分作者是相同的。 对应的 Paper:[2406.01574] MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark 对应的数据集:TIGER-Lab/MMLU-Pro · Datasets at Hugging Face 对应的 Leaderboard:MMLU Pro - a Hugging Face Space by TIGER-Lab 二、摘要 在LLM...
该数据集包含 12K 个跨学科的复杂问题。该数据集由滑铁卢大学,多伦多大学,卡内基梅隆大学的研究人员于 2024 年发布,相关论文成果为「MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark」。 问题和选项:数据集中的每个问题通常有 10 个多项选择题选项,但在人工审核过程中,一些...
[2]https://github.com/TIGER-AI-Lab/MMLU-Pro/issues/5 [3]https://www.reddit.com/r/LocalLLaMA/comments/1du52gf/mmlupro_is_a_math_benchmark/?utm_source=ainews&utm_medium=email&utm_campaign=ainews-et-tu-mmlu-pro [4]https://x.com/WenhuChen/status/1790597967319007564 [5]https://x.co...
MMLU-Pro, which addresses these limitations by incorporating more challenging, reasoning-intensive tasks and increasing the number of distractor options from three to nine. This benchmark spans 14 diverse domains, encompassing over 12,000 questions, thu...
In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance...
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024] - TIGER-AI-Lab/MMLU-Pro
The code and data for "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" [NeurIPS 2024] - MMLU-Pro/README.md at main · TIGER-AI-Lab/MMLU-Pro
id=mmlu-pro-benchmark. (4) TIGER-Lab Introduces MMLU-Pro Dataset for Comprehensive Benchmarking of ... https://www.marktechpost.com/2024/05/16/tiger-lab-introduces-mmlu-pro-dataset-for-comprehensive-benchmarking-of-large-language-models-capabilities-and-performance/. (5) undefined. https://...
Existing benchmarks for large language models (LLMs) increasingly struggle to differentiate between top-performing models, underscoring the need for more challenging evaluation frameworks. We introduce MMLU-Pro+, an enhanced benchmark building upon MMLU-Pro to assess shortcut learning and higher-order...
[3]https://www.reddit.com/r/LocalLLaMA/comments/1du52gf/mmlupro_is_a_math_benchmark/?utm_source=ainews&utm_medium=email&utm_campaign=ainews-et-tu-mmlu-pro [4]https://x.com/WenhuChen/status/1790597967319007564 [5]https://x.com/WenhuChen/with_replies ...