NonMatchingSplitsSizesError

sl02 · January 19, 2023, 8:12pm

I created a custom script which splits the raw file into train/test split on the fly. The script works with the default arguments. However, when I change the test_size ratio which I pass via load_dataset(), it fails with the following error

Traceback (most recent call last):                                                                                                                                                                                                                            
  File "<stdin>", line 1, in <module>
  File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/load.py", line 1757, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 860, in download_and_prepare
    self._download_and_prepare(
  File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 1611, in _download_and_prepare
    super()._download_and_prepare(
  File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 971, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 74, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError

It fails the integrity check as expected. The Build and load doesn’t show how to update the checks. I thought, using the download_mode=force_redownload argument in load_dataset() would fix it but it throws the same error as shown above. How do I resolve this?

polinaeterna · January 25, 2023, 12:10pm

Hi @sl02 ! Is test_size a custom builder parameter you define in your loading script?

You can set ignore_verifications=True param in load_dataset to skip splits sizes verification.

Also note that Dataset object has .train_test_split() method, probably it might be useful for your case.

sl02 · January 27, 2023, 1:14pm

Hi @polinaeterna
Yes. test_size is a parameter. Sure with the ignore_verifications=True parameter it works. But I would like to know how, for other datasets when it changes at the source, do you update the information; The instructions in the document, to which I provide a link in the above thread, doesn’t explain this clearly.

I am doing a group shuffle split because I have to ensure no overlap in the id column in the respective splits.

polinaeterna · January 27, 2023, 5:56pm

@sl02
When you load your dataset locally for the first time, it creates dataset_info.json file under its cache folder, the file contains all these splits info (like num_examples, num_bytes, etc.). If you regenerate the dataset while the script is unchanged (for example, run load_dataset with download_mode="reuse_cache_if_exists"), it performs verifications against this file.

We used to have dataset_info.json files in datasets repositories on the Hub (so, not just in a local cache folder) to verify splits info on the first download but now it’s deprecated, we use README.md instead for storing these numbers.
To (re)compute these numbers automatically and dump them to a README.md file, one should run datasets-cli test your_dataset --save_info. And as it’s done manually, it depends on datasets’ authors if they update and push this info or not as it’s not required.
Hope it’s more or less clear, feel free to ask any questions if it’s not

sl02 · January 28, 2023, 2:18pm

@polinaeterna
Thanks for clearing that up!

hjerpe · September 13, 2023, 7:07pm

Note that you could get this error when you try and download an updated dataset without using the cache. E.g.,
dataset = load_dataset(url, download_mode=“force_redownload”)

If the underlying dataset has been updated there can be a miss-match between the number of read records and what is read from the cache. You can read about the cache here, Cache management.

Topic		Replies	Views
Load_dataset_builder verification 🤗Datasets	3	317	January 12, 2024
Download only 1 of many parquet file 🤗Datasets	2	226	March 19, 2025
Datasets.load_dataset not returning 'eval' or 'test' 🤗Datasets	2	684	May 17, 2022
Load_dataset split=‘test’ not working again Beginners	3	32	April 19, 2025
Dataset shows 0 rows when loaded but full when pushed 🤗Datasets	0	421	July 26, 2023

NonMatchingSplitsSizesError

Related topics