thank you for your understanding. 项目贡献者或组织清单 ReactiveCJ 引用Citation / How do I cite Us? @misc{bright_xu_2019_3402023, author = {Bright Xu}, title = {NLP Chinese Corpus: Large Scale Chinese Corpus for NLP }, month = sep, year = 2019, doi = {10.5281/zenodo.3402023}, ver...
Liang Xu, NLPCC2019: Large-Scale Chinese Datasets for NLP, http://github.com/brightmart/nlp_chinese_corpus 也请发邮件告知我们您的论文名称或在这个项目的数据集上的工作 Reference 利用Python构建Wiki中文语料词向量模型试验 A tool for extracting plain text from Wikipedia dumps Open Chinese convert (...
Liang Xu, NLPCC2019: Large-Scale Chinese Datasets for NLP, http://github.com/brightmart/nlp_chinese_corpus 也请发邮件告知我们您的论文名称或在这个项目的数据集上的工作 Reference 利用Python构建Wiki中文语料词向量模型试验 A tool for extracting plain text from Wikipedia dumps Open Chinese convert (...
立即登录 没有帐号,去注册 编辑仓库简介 简介内容 大规模中文自然语言处理语料 Large Scale Chinese Corpus for NLP 主页 取消 保存更改 1 https://gitee.com/xxzhewly/nlp_chinese_corpus.git git@gitee.com:xxzhewly/nlp_chinese_corpus.git xxzhewly nlp_chinese_corpus nlp_chinese_corpus master深圳...
In this paper, we introduce the Chinese corpus from CLUE organization, CLUECorpus2020, a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which...
CORGI-PM (Zhang et al., 2023a) filters out sentences that might have gender bias from a large-scale Chinese corpus, constructing a dataset for gender bias detection, classification, and mitigation tasks. Usually, most studies measure performance using ROC-AUC (Do, 2019; Park et al., 2018;...
Deep learning based natural language processing model is proven powerful, but need large-scale dataset. Due to the significant gap between the real-world tasks and existing Chinese corpus, in this paper, we introduce a large-scale corpus of informal Chinese. This corpus contains around 37 million...
This paper presents the NLPR Chinese Language Model Toolkit (v 1.0) for constructing and testing Chinese language model. The toolkit provides extra efficiency and functionality when dealing with large Chinese text data corpus. We introduce the techniques of ...
In this technical report, we release the Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data. To the best of our knowledge, CPM, with 2.6 billion parameters and 100GB Chinese training data, is the largest Chinese pre-trained language model...
Word similarity is a fundamental problem in natural language processing.Chinese word similarity was computed based on a large scale corpus.First,a platform for computing the word similarity was implemented.This platform is easy for researchers to combine all kinds of algorithms to obtain the word sim...