Download only 1 of many parquet file

sachin · August 10, 2024, 11:48am

Hi all,

Just wondering if there is a way to download just 1 (or 2) of the parquet files that are available.

This answer suggests using streaming, but wondering if there was a different way to do this too. For example one answer that has potential but isn’t working for me is:

import datasets
import config

if __name__ == "__main__":
    hyper_parameters = config.DataConfig()

    dataset = datasets.load_dataset(
        "Multimodal-Fatima/COCO_captions_train",
        cache_dir=config.IMAGE_DOWNLOAD_PATH,
        data_files={"train": "data/train-00000-of-00038-757e7d149500e41c.parquet"},
    )
    print(len(dataset["train"]))

which gives the error

Generating train split:   3%|██▊                                                                                                          | 2982/113287 [00:00<00:14, 7393.63 examples/s]
Traceback (most recent call last):
  File "/Users/sachinthakaabeywardana/personal_work/tiny_captions/src/download.py", line 7, in <module>
    dataset = datasets.load_dataset(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.12/site-packages/datasets/load.py", line 2582, in load_dataset
    builder_instance.download_and_prepare(
  File "/opt/homebrew/lib/python3.12/site-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/opt/homebrew/lib/python3.12/site-packages/datasets/builder.py", line 1118, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/opt/homebrew/lib/python3.12/site-packages/datasets/utils/info_utils.py", line 101, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=18595506212.0, num_examples=113287, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=481592719, num_examples=2982, shard_lengths=None, dataset_name='coco_captions_train')}]

lhoestq · August 19, 2024, 1:28pm

using the latest version of datasets should fix the issue

zera09 · March 19, 2025, 8:30am

same issue persist even when datasets is updated. I got this error

NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=84881274635, num_examples=199036, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=1957796096, num_examples=4629, shard_lengths=[1200, 1300, 1200, 929], dataset_name='vision_arena-chat')}]

does it have to do with the how the dataset is stored/formated when it was uploaded to hf

Topic		Replies	Views
Downloading a portion of parquet files 🤗Datasets	3	651	May 23, 2024
Download only a subset of a split 🤗Datasets	10	16552	February 25, 2025
Load Dataset and Save as Parquet 🤗Datasets	3	3982	January 7, 2025
Dataset not downloading 🤗Datasets	3	970	April 19, 2023
Dataset generation error after downloading all the parquet files 🤗Datasets	6	4930	December 11, 2024

Download only 1 of many parquet file

Related topics