semchunk is a fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks. It has built-in support for tokenizers from OpenAI's tiktoken and Hugging Face's transformers and tokenizers libraries, in addition to supporting custom tokenizers and token cou...
Note that the following optional features can be enabled upon building Sonic:allocator-jemalloc,tokenizer-chineseandtokenizer-japanese(some might be already enabled by default). 👉 Install from Cargo: You can install Sonic directly withcargo install: ...
The tokenizer is inspired by the approach in Stanford’s CoreNLP i.e. write down a bunch of regular expressions and use compile them into a fast DFA. We use re2c, a light-weight scanner generator which often produces C that’s as fast as a handwritten equivalent. Indeed, tokenization is...
Note that the following optional features can be enabled upon building Sonic: allocator-jemalloc, tokenizer-chinese and tokenizer-japanese (some might be already enabled by default). 👉 Install from Cargo: You can install Sonic directly with cargo install: cargo install sonic-server Ensure that yo...