GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.
It automatically looks at the outputs/mt_bench directory.python3 eval_mt_bench.py --model --mode pairwise-baseline --parallel 32 --bench-name mt_bench --baseline-model archon-claude-3-5-sonnet-20240620 or this script to evaluate directly with a judge (no comparison). For example on Qwe...
assets Add multilingual MT-Bench files Jul 5, 2024 data Add multilingual MT-Bench files Jul 5, 2024 docker Add multilingual MT-Bench files Jul 5, 2024 docs Add multilingual MT-Bench files Jul 5, 2024 fastchat Add multilingual MT-Bench files Jul 5, 2024 fschat.egg-info Add multilingual MT...
MT-Bench-101 (open-compass#1215) … 34bcd8f Leymore pushed a commit to Leymore/opencompass that referenced this pull request Jul 12, 2024 MT-Bench-101 (open-compass#1215) … adebf68 Sign up for free to join this conversation on GitHub. Already have an account? Sign in to commen...
opencompass/datasets/subjective/mtbench101.py| docs/zh_cn/advanced_guides/compassbench_intro.md ) repos: 62 changes: 62 additions & 0 deletions 62 configs/datasets/subjective/multiround/mtbench101_judge.py Original file line numberDiff line numberDiff line change @@ -0,0 +1,62 @@ from ...
1 change: 1 addition & 0 deletions 1 scripts/mtbench_eval.py @@ -273,6 +273,7 @@ def play_a_match_wrapper(match): columns = ['basemodel_name'] + df_summary.category.values.tolist() data = [[cfg.metainfo.basemodel_name] + df_summary.score.values.tolist()] mtbench_df = ...
data mt_bench_output.json demo-pld.ipynb prompt-lookup-decoding.ipynb Binary file added BIN +6 KB .DS_Store Binary file not shown. 1 change: 1 addition & 0 deletions 1 data/mt_bench_output.json Load diff Large diffs are not rendered by default. 511 changes: 511 additions ...
git clone --recurse-submodules https://github.com/CONE-MT/BenchMAX.git cd BenchMAX pip install -r requirements.txt Evaluation Rule-based Instruction Following Task We employ lm-evaluation-harness to run this task. First clone its repository and install the lm-eval package: git clone --depth...
SETI multi-threaded MB/AP Benchmark Tool. Contribute to Ricks-Lab/benchMT development by creating an account on GitHub.
click to view the code # env mt_benchgitclonehttps://github.com/lm-sys/FastChat.gitcdFastChat pip install -e".[model_worker,llm_judge]"python gen_judgment.py --model-list gpt-3.5-turbo gpt-4 --parallel 2 python show_result.py --model-list gpt-3.5-turbo gpt-4 ...