from datasets import load_dataset c4_subset = load_dataset('allenai/c4', data_files='en/c4-train.0000*-of-01024.json.gz') 使用split参数指定自定义拆分(见下一节) 1.2 本地和远程文件 本地或远程的数据集,存储类型为csv,json,txt或parquet文件都可以加载: 1.2.1 CSV #多个 CSV 文件: dataset ...
更新load_dataset的格式 1 parent 2a8b8c4 commit 55c7c58 File treeREADME.md demo demo_pt.py demo_sft.py mini_data/pt/accommodation_catering_hotel/english/high rank_00726.parquet mini_qwen_pt.py mini_qwen_sft.py utils save_mini_data.py7 files changed +29 -16lines changed README.md...
_dataset_path == 'c4': self._dataset_name = 'realnewslike' data = load_dataset(self._dataset_path, self._dataset_name, split=self._split) else: data = load_dataset(os.path.join(self._dataset_path,'wikitext'), self._dataset_name, split=self._split, cache_dir="/home/chen...
Downloading and preparing dataset json/c4-en-html-with-metadata to C:\Users\...\.cache\huggingface\datasets\json\c4-en-html-with-metadata-4635c2fd9249f62d\0.0.0\c90812beea906fcffe0d5e3bb9eba909a80a998b5f88e9f8acbd320aa91acfde... 100%|█████████████████████...