对应的 Leaderboard:MMLU Pro - a Hugging Face Space by TIGER-Lab 二、摘要 在LLM 的发展历程中,MMLU 这样的基准测试在推动 AI 在不同领域的语言理解和推理方面起到关键作用。然而,随着模型的不断改进,这些基准测试的性能开始趋于稳定,辨别不同模型能力的差异变得越来越困难。 因此作者创建了 MMLU-Pro,这是一...
MMLU-Pro |🤗 Dataset | 🏆Leaderboard | 📖 Paper | This repo contains the evaluation code for the NeurIPS-24 paper "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" Introduction We introduce MMLU-Pro, an enhanced benchmark designed to evaluate language und...
Breadcrumbs MMLU-Pro / README.mdTop File metadata and controls Preview Code Blame 81 lines (63 loc) · 4.83 KB Raw MMLU-Pro |🤗 Dataset | 🏆Leaderboard | 📖 Paper |This repo contains the evaluation code for the NeurIPS-24 paper "MMLU-Pro: A More Robust and Challenging Multi-Task ...
- **Tasks for instruct models**: Math-Hard, IFeval, GPQA, and MMLU-Pro These tasks are common evalutions, many of which overlap with the Hugging Face [Open LLM Leaderboard v2](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) Here, we aim to get the benchmark ...
MMLU-Pro |🤗 Dataset | 🏆Leaderboard | 📖 Paper | This repo contains the evaluation code for the NeurIPS-24 paper "MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark" Introduction We introduce MMLU-Pro, an enhanced benchmark designed to evaluate language und...