论文标题:Evaluating Large Language Models: A Comprehensive Survey 论文地址:arxiv.org/abs/2310.1973评估中关键进展和局限性。此外,之前的综述主要集中在LLM的一致性评估上。 这篇综述扩大了范围,综合了LLMs的能力和一致性评估的研究结果。通过综合的观点和扩展的范围,综述工作在补充了这些先前的综述,提供了LLM评估...
原文链接 [2401.01711] Evaluating Large Language Models in Semantic Parsing for Conversational Question Answering over Knowledge Graphs (arxiv.org)Evaluating Large Language Models in Semantic Parsing…
The advent of large language models such as ChatGPT, Gemini, and others has underscored the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been comprehensively assessed. This study ...
This study explores the use of Large Language Models (LLMs), specifically GPT-4, in analysing classroom dialogue—a key task for teaching diagnosis and quality improvement. Traditional qualitative methods are both knowledge- and labour-intensive. This research investigates the potential of LLMs to ...
Large language models (LLMs) have performed remarkably well in various natural language processing tasks by benchmarking, including in the Western medical domain. However, the professional evaluation benchmarks for LLMs have yet to be covered in the traditional Chinese medicine(TCM) domain, which ...
OpenAI Codex Software & Engineering Transformers Compute Scaling Language Generative Models Authors Mark Chen,Jerry Tworek,Heewoo Jun,Qiming Yuan,Henrique Pondé,Jared Kaplan,Harri Edwards,Yura Burda,Nicholas Joseph,Greg Brockman,Alex Ray,Raul Puri,Gretchen Krueger,Michael Petrov,Heidy Khlaaf ...
Meet BigCodeBench by BigCode: The New Gold Standard for Evaluating Large Language Models on Real-World Coding Tasks
Patients increasingly use large language models (LLMs) for health-related information, but their reliability and usefulness remain controversial. Continuous assessment is essential to evaluate their role in patient education. This study evaluates the performance of ChatGPT 3.5 and Gemini in answering ...
As opposed to evaluating computation and logic-based reasoning, current benchmarks for evaluating large language models (LLMs) in medicine are primarily focused on question-answering involving domain knowledge and descriptive reasoning. While such qualitative capabilities are vital to medical diagnosis, in...
generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks ...