先看下论文里面弱智吧的数据集。该数据集已经开源,在 huggingface: m-a-p/COIG-CQIA · Datasets at Hugging Facefrom datasets import load_dataset dataset = load_dataset("m-a-p/COIG-CQIA", 'r…
这篇论文《COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning》,主要介绍了一个高质量的中文指令调优数据集COIG-CQIA。里面特别提到了研究者还收集了百度弱智吧的标题数据,且用这个数据...
Our aim is to build a diverse, wide-ranging instruction-tuning dataset to better align model behavior with human interactions. To this end, we collect a high-quality human-written corpus from various sources on the Chinese Internet, including Q&A communities, Wikis, examinations, and existing NLP...
这些问题都来自弱智吧,一个被忽略的语料宝库。 高质量的语料 这两天出现了一篇有趣论文《COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning》,大意是《搞微调,还得拼质量》。 我们都知道,在大模型训练中,我们的中文数据集多多多多多多多多多多多多多多少少有点问题:要么是英文派生的,要么是...