I created a custom script which splits the raw file into train/test split on the fly. The script works with the default arguments. However, when I change the test_size ratio which I pass via load_dataset(), it fails with the following error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/load.py", line 1757, in load_dataset
builder_instance.download_and_prepare(
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 860, in download_and_prepare
self._download_and_prepare(
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 1611, in _download_and_prepare
super()._download_and_prepare(
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 971, in _download_and_prepare
verify_splits(self.info.splits, split_dict)
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 74, in verify_splits
raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError
It fails the integrity check as expected. The Build and load doesnāt show how to update the checks. I thought, using the download_mode=force_redownload argument in load_dataset() would fix it but it throws the same error as shown above. How do I resolve this?
Hi @polinaeterna
Yes. test_size is a parameter. Sure with the ignore_verifications=True parameter it works. But I would like to know how, for other datasets when it changes at the source, do you update the information; The instructions in the document, to which I provide a link in the above thread, doesnāt explain this clearly.
I am doing a group shuffle split because I have to ensure no overlap in the id column in the respective splits.
@sl02
When you load your dataset locally for the first time, it creates dataset_info.json file under its cache folder, the file contains all these splits info (like num_examples, num_bytes, etc.). If you regenerate the dataset while the script is unchanged (for example, run load_dataset with download_mode="reuse_cache_if_exists"), it performs verifications against this file.
We used to have dataset_info.json files in datasets repositories on the Hub (so, not just in a local cache folder) to verify splits info on the first download but now itās deprecated, we use README.md instead for storing these numbers.
To (re)compute these numbers automatically and dump them to a README.md file, one should run datasets-cli test your_dataset --save_info. And as itās done manually, it depends on datasetsā authors if they update and push this info or not as itās not required.
Hope itās more or less clear, feel free to ask any questions if itās not
Note that you could get this error when you try and download an updated dataset without using the cache. E.g.,
dataset = load_dataset(url, download_mode=āforce_redownloadā)
If the underlying dataset has been updated there can be a miss-match between the number of read records and what is read from the cache. You can read about the cache here, Cache management.