图1是 GPT-4o 词表里面最长的中文词,图2是双字中文词,图3是 GPT-4o 把 “给主人留下些什么吧” 当作一个 token,认为是夸奖的意思。图4是比较正常的 GPT-4 词表(cl100k_base),虽然 tokenizer 对中文不太友好,中文占用 token 数较多,但至少没有太多奇奇怪怪的 token。
tokenizergpt-4tiktokengpt35turbocl100kbase UpdatedAug 25, 2024 PHP Add a description, image, and links to thecl100kbasetopic page so that developers can more easily learn about it. To associate your repository with thecl100kbasetopic, visit your repo's landing page and select "manage topics...
tokenizer = tiktoken.get_encoding("cl100k_base" if model_name == "gpt-3.5-turbo" else "p50k_base") to tokenizer = tiktoken.get_encoding("p50k_base") everything works as expected. Code snippets import tiktoken from langchain import OpenAI, PromptTemplate full_text = "The content of...
在搜索了相当长的一段时间后,似乎没有cl100k_base标记器的javascript实现。作为一个简单的interrim解决...
MicrosoftMLTokenizerV1_0_0_CountTokensKing(...)edy. [275]3,871.2 ns0.650.0153-96 B0.18 TokenizerLibV1_3_3_CountTokensKing(...)edy. [275]7,465.8 ns1.253.08230.137319344 B37.20 Tiktoken_CountTokensKing(...)edy. [275]2,744.5 ns0.460.3128-1976 B3.80 ...