semchunk is a fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks. It has built-in support for tokenizers from OpenAI's tiktoken and Hugging Face's transformer
Note that the following optional features can be enabled upon building Sonic:allocator-jemalloc,tokenizer-chineseandtokenizer-japanese(some might be already enabled by default). 👉 Install from Cargo: You can install Sonic directly withcargo install: ...
The tokenizer is inspired by the approach in Stanford’sCoreNLPi.e. write down a bunch of regular expressions and use compile them into a fastDFA. We usere2c, a light-weight scanner generator which often produces C that’s as fast as a handwritten equivalent. Indeed, tokenization is quite ...
Note that the following optional features can be enabled upon building Sonic: allocator-jemalloc, tokenizer-chinese and tokenizer-japanese (some might be already enabled by default). 👉 Install from Cargo: You can install Sonic directly with cargo install: cargo install sonic-server Ensure that yo...