I created a custom script which splits the raw file into train/test split on the fly. The script works with the default arguments. However, when I change the test_size
ratio which I pass via load_dataset()
, it fails with the following error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/load.py", line 1757, in load_dataset
builder_instance.download_and_prepare(
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 860, in download_and_prepare
self._download_and_prepare(
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 1611, in _download_and_prepare
super()._download_and_prepare(
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/builder.py", line 971, in _download_and_prepare
verify_splits(self.info.splits, split_dict)
File "/Users/home/.local/share/virtualenvs/1717-yQ3Y_lVD/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 74, in verify_splits
raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError
It fails the integrity check as expected. The Build and load doesn’t show how to update the checks. I thought, using the download_mode=force_redownload
argument in load_dataset()
would fix it but it throws the same error as shown above. How do I resolve this?