Reminder I have read the README and searched the existing issues. System Info Reproduction 使用2048*2048的图片,总量3万个图文对,sharegpt格式的数据集。 设置preprocessing_num_workers=256 或者128/64等,都会在Running tokenizer on dataset的时候暂停,在长时间
@SaulLu when I use the wikitext-103 dataset the tokenizer hangs with Running tokenizer on dataset and shows no progress. This was not always an issue but as of today has become one. It will either hang at the end of tokenizing or at the very beginning. Any idea why this would be han...