Multi-category Corpora: 多类别语料库包含两种或多种类型的数据,有利于增强LLMs的泛化能力。 1.2 Domain-specific Pre-training Corpora 是针对特定领域的预训练语料,该类型的预料通常用于LLM的增量预训练阶段,如果需要将模型应用于特定领域的下游任务,可以进一步利用特定领域的预训练语料来增量预训练模型。 Financial Dom...
Take advantage of our high-quality monolingual datasets for LLMs and start achieving better results in your natural language processing tasks.
社会规范评估数据集从伦理、道德、偏见、毒性和安全等维度评估LLMs。如SafetyBench。 Factuality: 评估LLMs的输出的事实性(幻觉程度)。如FACTOR、HaluEval。 Evaluation: LLMs的兴起为评估提供了新范式,许多工作将LLMs作为评估者,评估类数据集用来评估LLMs作为评估者的可靠性。如FairEval、LLMEval2。 Multitask: 多...
The #1 voice data provider for LLMs. Access ethically sourced, pre-labeled voice & video datasets in hundreds of languages, trusted by the world's top brands.
Discover our datasets for artificial intelligence applications. Improve your projects using the largest data sets from Pangeanic.
* [Datasets for LLMs Internship](https://apply.workable.com/huggingface/j/4A6EA3243C/), building datasets to train the next generation of large language models, and the assorted tools. The following other internship positions are available: 0 comments on commit 3deb77f Please sign in to ...
Awesome-LLMs-Datasets:总结现有代表性大型语言模型(LLMs)文本数据集的五个维度:预训练语料库、微调指令数据集、偏好数据集、评估数据集和传统自然语言处理(NLP)数据集。(定期更新)地址:github.com/lmmlzn/Awesome-LLMs-Datasets还有对应的研究论文,提供了现有可用数据集资源的全面回顾,包括来自444个数据集的统计数据...
I'm Sebastian: a machine learning & AI researcher, programmer, and author. As Staff Research Engineer Lightning AI, I focus on the intersection of AI research, software development, and large language models (LLMs).
Ever-larger datasets for AI training pose big challenges for data engineers and big risks for the models themselves. Credit: Marcus Buchwald From early-2000s chatbots to the latest GPT-4 model, generative AI continues to permeate the lives of workers both in and out of the tech industry. ...
(2)使用真实的人类与LLMs的对话数据作为指令数据集。 (3)用多个LLMs/Agents进行对话,并获取其对话数据作为指令数据集。 收集和改写现有数据集(CI) 优点:多样性和综合性、规模大、节省时间 缺点:质量和格式标准化、数据许可 综合以上方法: HG&CI、HG&MC、CI&MC、HG&CI&MC ...