Hi all,
Just wondering if there is a way to download just 1 (or 2) of the parquet files that are available.
This answer suggests using streaming, but wondering if there was a different way to do this too. For example one answer that has potential but isn’t working for me is:
import datasets
import config
if __name__ == "__main__":
hyper_parameters = config.DataConfig()
dataset = datasets.load_dataset(
"Multimodal-Fatima/COCO_captions_train",
cache_dir=config.IMAGE_DOWNLOAD_PATH,
data_files={"train": "data/train-00000-of-00038-757e7d149500e41c.parquet"},
)
print(len(dataset["train"]))
which gives the error
Generating train split: 3%|██▊ | 2982/113287 [00:00<00:14, 7393.63 examples/s]
Traceback (most recent call last):
File "/Users/sachinthakaabeywardana/personal_work/tiny_captions/src/download.py", line 7, in <module>
dataset = datasets.load_dataset(
^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.12/site-packages/datasets/load.py", line 2582, in load_dataset
builder_instance.download_and_prepare(
File "/opt/homebrew/lib/python3.12/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/opt/homebrew/lib/python3.12/site-packages/datasets/builder.py", line 1118, in _download_and_prepare
verify_splits(self.info.splits, split_dict)
File "/opt/homebrew/lib/python3.12/site-packages/datasets/utils/info_utils.py", line 101, in verify_splits
raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=18595506212.0, num_examples=113287, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=481592719, num_examples=2982, shard_lengths=None, dataset_name='coco_captions_train')}]