对应的数据集:TIGER-Lab/MMLU-Pro · Datasets at Hugging Face 对应的 Leaderboard:MMLU Pro - a Hugging Face Space by TIGER-Lab 二、摘要 在LLM 的发展历程中,MMLU 这样的基准测试在推动 AI 在不同领域的语言理解和推理方面起到关键作用。然而,随着模型的不断改进,这些基准测试的性能开始趋于稳定,辨别不同...
Researchers from the University of Waterloo, the University of Toronto, and Carnegie Mellon University propose a new benchmark/leaderboard, MMLU-Pro, which addresses these limitations by incorporating more challenging, reasoning-intensive tasks and increasi...
|🤗 Dataset | 🏆Leaderboard | 📖 Paper | This repo contains the evaluation code for the NeurIPS-24 paper "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" Introduction We introduce MMLU-Pro, an enhanced benchmark designed to evaluate language understanding ...
Breadcrumbs MMLU-Pro / README.mdTop File metadata and controls Preview Code Blame 81 lines (63 loc) · 4.83 KB Raw MMLU-Pro |🤗 Dataset | 🏆Leaderboard | 📖 Paper |This repo contains the evaluation code for the NeurIPS-24 paper "MMLU-Pro: A More Robust and Challenging Multi-Task ...
To align with the paper and leaderboard, use evaluate_from_local.py for open-source models and evaluate_from_api.py for proprietary models. Contributor Author chigkim commented Jul 9, 2024 Thank you for the clarification! I'm hoping you could help me with one more question. The script eva...
|🤗 Dataset | 🏆Leaderboard | 📖 Paper | This repo contains the evaluation code for the NeurIPS-24 paper "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" Introduction We introduce MMLU-Pro, an enhanced benchmark designed to evaluate language understanding ...