Describe the bug When I load a dataset from a number of arrow files, as in: random_dataset = load_dataset( "arrow", data_files={split: shard_filepaths}, streaming=True, split=split, ) I'm able to get fast iter
示例代码如下:只需要设置streaming= True 即可,这个load上来的数据是一个可迭代对象,你之后的处理与前面介绍的一样,因为我们也没有那么大数据量的需求,就不详细介绍了,有需要的老大们看教程。 pubmed_dataset_streamed = load_dataset( "json" , data_files=data_files, split= "train" , streaming= True ) ...
I would imagine that something like (Streaming true or false): d = load_dataset("new_dataset.py", storage_options=storage_options, split="train") would work with # new_dataset.py ... _URL="abfs://container/image_folder``` archive_path = dl_manager.download(_URL) split_metadata_paths...