NonMatchingSplitsSizesError

I created a custom script which splits the raw file into train/test split on the fly. The script works with the default arguments. However, when I change the test_size ratio which I pass via load_dataset(), it fails with the following error

Traceback (most recent call last):                                                                                                                                                                                                                            
  File "<stdin>", line 1, in <module>
  File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/load.py", line 1757, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 860, in download_and_prepare
    self._download_and_prepare(
  File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 1611, in _download_and_prepare
    super()._download_and_prepare(
  File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 971, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 74, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError

It fails the integrity check as expected. The Build and load doesn’t show how to update the checks. I thought, using the download_mode=force_redownload argument in load_dataset() would fix it but it throws the same error as shown above. How do I resolve this?

1 Like

Hi @sl02 ! Is test_size a custom builder parameter you define in your loading script?

You can set ignore_verifications=True param in load_dataset to skip splits sizes verification.

Also note that Dataset object has .train_test_split() method, probably it might be useful for your case.

1 Like

Hi @polinaeterna
Yes. test_size is a parameter. Sure with the ignore_verifications=True parameter it works. But I would like to know how, for other datasets when it changes at the source, do you update the information; The instructions in the document, to which I provide a link in the above thread, doesn’t explain this clearly.

I am doing a group shuffle split because I have to ensure no overlap in the id column in the respective splits.

@sl02
When you load your dataset locally for the first time, it creates dataset_info.json file under its cache folder, the file contains all these splits info (like num_examples, num_bytes, etc.). If you regenerate the dataset while the script is unchanged (for example, run load_dataset with download_mode="reuse_cache_if_exists"), it performs verifications against this file.

We used to have dataset_info.json files in datasets repositories on the Hub (so, not just in a local cache folder) to verify splits info on the first download but now it’s deprecated, we use README.md instead for storing these numbers.
To (re)compute these numbers automatically and dump them to a README.md file, one should run datasets-cli test your_dataset --save_info. And as it’s done manually, it depends on datasets’ authors if they update and push this info or not as it’s not required.
Hope it’s more or less clear, feel free to ask any questions if it’s not :slight_smile:

3 Likes

@polinaeterna
Thanks for clearing that up!

Note that you could get this error when you try and download an updated dataset without using the cache. E.g.,
dataset = load_dataset(url, download_mode=“force_redownload”)

If the underlying dataset has been updated there can be a miss-match between the number of read records and what is read from the cache. You can read about the cache here, Cache management.