Feature request LLM usually limits text by Tokens. It may be useful to split a large text into chunks according to the number of Tokens rather than the number of characters. For example, if LLM allows us to use 8000 tokens, and we want t...
load_from_cache_file=not data_args.overwrite_cache, desc=f"Grouping texts in chunks of {block_size}", ) else: lm_datasets = tokenized_datasets.map( group_texts, batched=True, ) if training_args.do_train: if "train" not in tokenized_datasets: raise ValueError("--do_train requires a ...
Let’s start with eliminating the most dreaded problem that is having to load all data into the RAM. If the data comes from a file, it would make sense to be able to only load portions of it and operate on these portions.Using skiprows and nrows arguments from Pandas’ read_csv it ...