We evaluate our code-generation approach on two code generation benchmarks: (i) a robotics-themed RoboCodeGen and (ii) HumanEval [1], which consists of standard code-gen problems RoboCodeGen: we introduce a new benchmark with 37 function generation problems with several key differences from pr...
Code generation models based on the pre-training and fine-tuning paradigm have been increasingly attempted by both academia and industry, resulting in well-known industrial models such as Codex, CodeGen, and PanGu-Coder. To evaluate the effectiveness of these models, multiple existing benchmarks (e...
Multi-lingual Evaluation of Code Generation Models pdf:https://openreview.net/pdf?id=Bo7eeXm6An8 ## TL;DR 之前的基于评测执行结果的数据集,几乎都是python语言的,本文提出了一个新的用来评估多语言代码生成的benchmark,内含NLprompt,多种程序语言和评测数据。 ## Details 代码生成一般来讲有两种eval方式,一...
EvoCodeBench EvoCodeBench is an evolutionary code generation benchmark aligned with real-world code repositories. Details of EvoCodeBench can be found in our paper "EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-world Code Repositories" Paper. News [Mar 29, 2024] We relea...
CodeFuseEval is a Code Generation benchmark that combines the multi-tasking scenarios of CodeFuse Model with the benchmarks of HumanEval-x and MBPP. 主页 取消 保存更改 1 https://gitee.com/codefuse-ai/codefuse-evaluation.git git@gitee.com:codefuse-ai/codefuse-evaluation.git codefuse-ai cod...
This approach aligns more closely with the practices of human developers and provides a valuable benchmark for the ongoing development of code generation models. Implications Since its inception in mid-2021, the HumanEval benchmark has not only become immensely popular but has also emerged as a ...
ClassEval is the first class-level code generation benchmark described in the paper"ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation". Please checkout ourClassEval Leaderboardfor the evaluation results of the most recent LLMs on class-level code generation...
text. DeepSeek Coder models are trained with a 16,000 token window size and an extra fill-in-the-blank task to enable project-level code completion and infilling. DeepSeek Coder achieves state-of-the-art performance on various code generation benchmarks compared to other open-source code ...
HumanEval-X benchmark for evaluating multilingual models by hand-writing the solutions in C++, Java, JavaScript, and Go. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X https://arxiv.org/pdf/2303.17568.pdf ...
Building on self-collaboration framework, the virtual team formed by ChatGPT (GPT-3.5) can achieve significant improvements compared to the single LLM agent on multiple code-generation benchmarks. (4)在某些实际场景中,自协作代码生成在更复杂的代码生成任务(如存储库级代码生成)上表现出显著的有效性,这...