My dataset JSON file had some mistakes in it, so I fixed them and re-uploaded it. I also ran the datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
with my dataset folder to re-generate the dataset_infos.json file with the metadata.
After uploading these two revised files, I am trying to test redownloading the dataset in Python again to make sure everything is working, and it is failing due to a checksum issue in the metadata. In particular, I am running this command (the dataset is private until we get it all figured out, so it has a use_auth_token
argument too):
ds = datasets.load_dataset('microsoft/stackoverflow_2022', split='test', use_auth_token='<...>', download_mode='force_redownload')
and while trying to redownload the dataset, it fails with this error:
File "<...>/info_utils.py", line 40, in verify_checksums
raise NonMatchingChecksumError(error_msg + str(bad_urls))
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['stackoverflow_2022_data_set_cleaned.json']
In fact, if I go to the repository and completely delete dataset_infos.json from it, I still get this error, which to me implies that the dataset_infos.json file is being cached somewhere, and I have to forcibly redownload it doing something similar to download_mode='force_redownload'
. That argument works for the dataset JSON file itself, but it is not working for dataset_infos.json. Is there something else I need to run, or is there actually a problem with that file?
Thanks!