Since its inception in mid-2021, the HumanEval benchmark has not only become immensely popular but has also emerged as a quintessential evaluation tool for measuring the performance of LLMs in code generation tasks. The [leaderboard](https://paperswithcode.com/sota/code-generation-on-humaneval...
benchmark发布时间任务类型题目数量覆盖编程语言语言评价方式SOTA方案得分(截止2024-12) CodeArena 2024-12 代码问答(偏生成代码任务) 397 44 2(中英) LLM-as-judge 93.3/4.4 (o1-mini) SWE-bench 2023-10 代码修复 2294 1(python) 来自github仓库(应该都是英文,未验证) 沙盒 29.38%( ✅ OpenHands + Code...
RoboCodeGen: we introduce a new benchmark with 37 function generation problems with several key differences from previous code-gen benchmarks: (i) it is robotics-themed with questions on spatial reasoning (e.g., find the closest point to a set of points), geometric reasoning (e.g., check...
HumanEval is a benchmark dataset developed by OpenAI that evaluates the performance of large language models (LLMs) in code generation tasks. It has become a significant tool for assessing the capabilities of AI models in understanding and generating code. In this tutorial, we will learn about ...
CodeFuse-MFTCoder: Multitask Fine-Tuned Code LLMs 4 57 9 codefuse-devops-eval A DevOps Domain Knowledge Evaluation Benchmark for Large Language Models 2 6 1 codefuse-chatbot 本项目是一个开源的 AI 智能助手,专为软件开发的全生命周期而设计,涵盖设计、编码、测试、部署和运维等阶段...
To navigate LLM code generation, developers rely on a suite of benchmark metrics to evaluate theperformance and capabilities of Code LLMs across diverse programming challenges. This section delves into the pivotal benchmarks shaping the assessment of Code LLMs, underscoring their significance in refin...
while the generative model integrates this data to create coherent and functional outputs. Additionally, incorporating automated testing and validation mechanisms during the generation process further enhances reliability, ensuring that the generated code meets quality benchmarks and integrates seamlessly into ...
llm-code-preference.github.io/ Topics code-generationllm-trainingllm-evaluationllms-benchmarking Resources Readme License View license Code of conduct Code of conduct Security policy Security policy Activity Custom properties Stars 33stars Watchers ...
In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500...
Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon further investigation, we...