Since its inception in mid-2021, the HumanEval benchmark has not only become immensely popular but has also emerged as a quintessential evaluation tool for measuring the performance of LLMs in code generation t
benchmark发布时间任务类型题目数量覆盖编程语言语言评价方式SOTA方案得分(截止2024-12) CodeArena 2024-12 代码问答(偏生成代码任务) 397 44 2(中英) LLM-as-judge 93.3/4.4 (o1-mini) SWE-bench 2023-10 代码修复 2294 1(python) 来自github仓库(应该都是英文,未验证) 沙盒 29.38%( ✅ OpenHands + Code...
HumanEval is a benchmark dataset developed by OpenAI that evaluates the performance of large language models (LLMs) in code generation tasks. It has become a significant tool for assessing the capabilities of AI models in understanding and generating code. ...
EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories论文笔记 Peking2025 1 人赞同了该文章 真实世界的代码库:基准测试应采集自真实世界的代码库。 真实代码分布:真实世界的代码库包含两类代码:独立代码(standalone)和非独立代码(non-standalone)。如图1所示,独立函数仅使...
(LLMs), such as GPT, Claude, Llama, Mistral, etc., have emerged as promising tools to assist in DL code generation, offering potential solutions to these challenges. Despite this, existing benchmarks such as DS-1000 are limited, as they primarily focus on small DL code snippets related ...
CodeFuseEval is a Code Generation benchmark that combines the multi-tasking scenarios of CodeFuse Model with the benchmarks of HumanEval-x and MBPP.
CodeFuse-MFTCoder: Multitask Fine-Tuned Code LLMs 4 57 9 codefuse-devops-eval A DevOps Domain Knowledge Evaluation Benchmark for Large Language Models 2 6 1 codefuse-chatbot 本项目是一个开源的 AI 智能助手,专为软件开发的全生命周期而设计,涵盖设计、编码、测试、部署和运维等阶段。
LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-...
We chose non-trivial kernels from the non-serial polyadic dynamic programming (NPDP) benchmark with non-uniform loops. The focus is on ensuring code validity and understanding limitations in obtaining valid CUDA code. Additionally, we assess the efficiency and scalability of NVIDIA A100 GPU codes ...
while the generative model integrates this data to create coherent and functional outputs. Additionally, incorporating automated testing and validation mechanisms during the generation process further enhances reliability, ensuring that the generated code meets quality benchmarks and integrates seamlessly into ...