I created a custom script which splits the raw file into train/test split on the fly. The script works with the default arguments. However, when I change the
test_size ratio which I pass via
load_dataset(), it fails with the following error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/load.py", line 1757, in load_dataset
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 860, in download_and_prepare
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 1611, in _download_and_prepare
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 971, in _download_and_prepare
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 74, in verify_splits
It fails the integrity check as expected. The Build and load doesn’t show how to update the checks. I thought, using the
download_mode=force_redownload argument in
load_dataset() would fix it but it throws the same error as shown above. How do I resolve this?
Hi @sl02 ! Is
test_size a custom builder parameter you define in your loading script?
You can set
ignore_verifications=True param in
load_dataset to skip splits sizes verification.
Also note that
Dataset object has
.train_test_split() method, probably it might be useful for your case.
test_size is a parameter. Sure with the
ignore_verifications=True parameter it works. But I would like to know how, for other datasets when it changes at the source, do you update the information; The instructions in the document, to which I provide a link in the above thread, doesn’t explain this clearly.
I am doing a group shuffle split because I have to ensure no overlap in the id column in the respective splits.
When you load your dataset locally for the first time, it creates
dataset_info.json file under its cache folder, the file contains all these splits info (like
num_bytes, etc.). If you regenerate the dataset while the script is unchanged (for example, run
download_mode="reuse_cache_if_exists"), it performs verifications against this file.
We used to have
dataset_info.json files in datasets repositories on the Hub (so, not just in a local cache folder) to verify splits info on the first download but now it’s deprecated, we use
README.md instead for storing these numbers.
To (re)compute these numbers automatically and dump them to a
README.md file, one should run
datasets-cli test your_dataset --save_info. And as it’s done manually, it depends on datasets’ authors if they update and push this info or not as it’s not required.
Hope it’s more or less clear, feel free to ask any questions if it’s not
Thanks for clearing that up!