SWE-bench Lite 是由普林斯顿大学提出的一个极具挑战性的、针对 LLM 解决真实 GitHub Issue 的 benchmark,近期受到工业界、学术界和创业团队的广泛关注。近日,豆包MarsCode Agent在 SWE-bench Lite 排行榜上位列第一。01多Agent 协作框架开发者在日常的开发工作中常常会遇到各种问题,例如: 运行测
AgentBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) as agents in interactive environments. LLMs, which are increasingly smart and autonomous, have expanded beyond traditional natural language processing tasks to tackle real-world pragmatic missions. Here are the ke...
We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding ...
open-compass/DevBench 类型:论文推荐星:4类别:服饰穿戴;3D服装解读DevBench数据集由22个经过策划和精选的代码库组成,涵盖了四种编程语言(Python、C/C++、Java、JavaScript)以及多个领域,如机器学习、数据库、Web服务和命令行工具。该数据集还提供了一个全面而自动化的评估套件,用于执行与软件开发相关的各种任务。Dev...
[EMNLP 2024] DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models - yiyihum/da-code
AGENT code - neutron transport benchmark example and extension to 3D lattice geometry. Nuclear Technology & Radiation Protection 20 (2), 10-16.Hursin, M., Jevremovic, T., 2005. Agent code: Neutron transport benchmark example and extension to 3D lattice geometry. Nuclear Technology and ...
@mihaela-bornea are you looking for the exact OpenDevin version running that CodeAct version? Or would just running the latest 0.6 work for you? The SWE-bench is a little involved to run. No, this issue can be closed. Question was actually for execution of CodeAct as a standalone featur...
This paper summarizes the recent new features added to AGENT code: full 3D capability based on coupling 2D MOC with 1D FDM, and full 2D core capability based on coupling single assembly solutions. To improve the accuracy of 3D solution we plan to implement a higher order 1D axial solution li...
金十数据4月9日讯,今天凌晨4点,著名大模型训练平台TogetherAI和智能体平台Agentica,联合开源了新模型DeepCoder-14B-Preview。该模型只有140亿参数,但在知名代码测试平台LiveCodeBench的测试分为60.6%,高于OpenAI的o1模型(59.5%),略低于o3-mini(60.9%)。在Codeforces、AIME2024上的评测数据同样非常出色,几乎与o1、o3-...
Augment Code launches AI technology that outperforms GitHub Copilot by 70% through real-time context understanding of massive codebases, securing $270M funding and achieving the highest score on SWE-bench verified.