Our results strongly advocate for revising MMLU's error-ridden questions to enhance its future utility and reliability as a benchmark. https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0. PDF Abstract Code Edit aryopg/mmlu-redux official 12 yuchenlin/zeroeval ↳ Quickstart in ...