MMLU 指标 论文:Measuring Massive Multitask Language Understanding 这应该是 HF LLM 排行榜上争议最大的一个指标。MMLU 可以理解为做选择题任务,包括了 humanities, STEM, mathematics, US history, computer science, law 等多个领域的任务。 完整的数据集可以在 huggingface 查看。 运行测评 在lm-evaluation-ha...
arc_challenge \ --batch_size auto \ --output_path ./eval_out/openbuddy13b \ --use_cache ./eval_cache # 使用accelerate启动器,这支持多GPU accelerate launch -m lm_eval --model hf \ --model_args pretrained=./openbuddy-llama2-13b-v11.1-bf16 \ --tasks mmlu,nq_open,triviaqa,truthfulqa...
使用此标志提供参数以初始化wandb运行(wandb.init)为逗号分隔的字符串参数。 lm_eval --model hf --model_args pretrained=microsoft/phi-2,trust_remote_code=True --tasks hellaswag,mmlu_abstract_algebra --device cuda:0 --batch_size 8 --output_path output/phi-2 --limit 10 --wandb_args project=...
--model_args pretrained=microsoft/phi-2,trust_remote_code=True \ --tasks hellaswag,mmlu_abstract_algebra \ --device cuda:0 \ --batch_size 8 \ --output_path output/phi-2 \ --limit 10 \ --wandb_args project=lm-eval-harness-integration \ --log_samples In the stdout, you will find ...
For fastest performance, we recommend using--batch_size autofor vLLM whenever possible, to leverage its continuous batching functionality! Tip Passingmax_model_len=4096or some other reasonable default to vLLM through model args may cause speedups or prevent out-of-memory errors when trying to use...
haileyschoelkopf Bump version to v0.4.4 ; Fixes to TMMLUplus (EleutherAI#2280) 543617f· Sep 5, 2024 HistoryHistory Breadcrumbs lm-evaluation-harness / pyproject.tomlTop File metadata and controls Code Blame 107 lines (98 loc) · 2.83 KB Raw [build-system] requires = ["setuptools>=40.8...
Arabic MMLU and aEXAMS by @khalil-hennara And more! Re-introduction ofTemplateLMbase class for lower-code new LM class implementations by @anjor Run the library with metrics/scoring stage skipped via--predict_onlyby @baberabb Many more miscellaneous improvements by a lot of great contributors!
mmlureadme bjudge baber bchat mela adamlin120/main revert-2083-patch-1 v0.4.4 v0.4.3 v0.4.2 v0.4.1 v0.4.0 v0.3.0 v0.2.0 v0.0.1 lm-evaluation-harness / ignore.txt ignore.txt 28 Bytes 一键复制 编辑 原始数据 按行查看 历史 Julen Etxaniz 提交于 2年前 . Add multilingual ...
lm_eval --model hf --model_args pretrained=microsoft/phi-2,trust_remote_code=True --tasks hellaswag,mmlu_abstract_algebra --device cuda:0 --batch_size 8 --output_path output/phi-2 --limit 10 --wandb_args project=lm-eval-harness-integration --log_samples ...
mmlu model_written_evals mutual noticia nq_open okapi openbookqa paws-x pile pile_10k piqa polemo2 prost pubmedqa qa4mre qasper race realtoxicityprompts sciq scrolls siqa squad_completion squadv2 storycloze super_glue swag swde tinyBenchmarks tmmluplus toxigen translation triviaqa truthfulqa...