Open-Domain Text Evaluation via Meta Distribution Modeling[J]. arXiv preprint arXiv:2306.11879, 2023.[25]Mesgar M, Bücker S, Gurevych I. Dialogue coherence assessment without explicit dialogue act labels[J]. arXiv preprint arXiv:1908.08486, 2019.[26]Vakulenko S, de Rijke M, Cochez M, et...
Hallucinationshave always been a major issue when it comes to LLMs. GPT 4.5 offers significant improvement in the domain with its reduced hallucination rate. In tests like SimpleQA, it outperformed GPT-4, making it a more reliable tool for research, professional use, and everyday queries. Perf...
we elaborate\non the Baichuan 2 architecture and scaling results.\nFinally, we describe the distributed training system.\n2.1 Pre-training Data\nData sourcing : During data
关于prompt分解推理过程的工作,MURMUR[23]发现data-to-text任务直接提示LLM推理易导致幻觉,而提示CoT推理缺乏推理步骤之间的明确条件,损害正确性,并且以不同的顺序线性化数据易造成较大的方差。因此,MURMUR提出首先依据预先定义的语法规则在每一步使用束搜索算法选择可能正确的模型范围,再根据打分模型选择最好的模型,以及...
Holistic evaluation of language models[J]. arXiv preprint arXiv:2211.09110, 2022.[6]Lees A, Tran V Q, Tay Y, et al. A new generation of perspective api: Efficient multilingual character-level transformers[C]//Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data ...
深度学习模型、甚至预训练小模型不同的特性,耳熟能详的如Few/Zero-Shot Learning、In-Context Learning...
InvestLM: A Large Language Model for Investment using Financial Domain Instruction Tuning BBT-Fin: Comprehensive Construction of Chinese Financial Domain Pre-trained Language Model, Corpus and Benchmark PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance ...
SELFCHECKGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language ModelsSELF-CONTRADICTORY HALLUCINATIONS OF LLMS: EVALUATION, DETECTION AND MITIGATION 对于如何度量模型随机生成的多个回答之间的不一致性,Self-Check尝试了包括Bert相似度计算在内的5种方法,其中效果最好的两种分别是传统NLI...
to keep track of advancements in the field. Model providers also need to check the presence of any biases to ensure of the quality of the starting dataset and of the correct behavior of their model. Gathering evaluation data is vital for model providers. Furthermore...
相关论文《Benchmark evaluation of DeepSeek large language models in clinical decision-making》,于 2025 年 4 月 23 日发布在《Nature Medicine》。 研究人员使用 125 例具有足够统计功效的患者病例,涵盖了广泛的常见病和罕见病,发现 DeepSeek 模型的表现与医学专用 LLM 相当,在某些情况下甚至更佳。