在用RobertaTokenizer对单词进行分词的时候,发现单词acquire会被分词两个词根,但是RobertaForMaskedLM可以预测出来单词acquire。 下面的代码可以看到把单词acquire分词成了'ac'和'quire' from transformers import AutoTokenizer, RobertaForMaskedLM import torch tokenizer = AutoTokenizer.from_pretrained("./...
一个轻量且功能全面的中文分词器,帮助学生了解分词器的工作原理。MicroTokenizer: A lightweight Chinese tokenizer designed for educational and research purposes. Provides a practical, hands-on approach to understanding NLP concepts, featuring multiple tokeni