Feature request LLM usually limits text by Tokens. It may be useful to split a large text into chunks according to the number of Tokens rather than the number of characters. For example, if LLM allows us to use 8000 tokens, and we want t...
load_from_cache_file=not data_args.overwrite_cache, desc=f"Grouping texts in chunks of {block_size}", ) else: lm_datasets = tokenized_datasets.map( group_texts, batched=True, ) if training_args.do_train: if "train" not in tokenized_datasets: raise ValueError("--do_train requires a ...