Dataset_infos.json getting cached?

My dataset JSON file had some mistakes in it, so I fixed them and re-uploaded it. I also ran the datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs with my dataset folder to re-generate the dataset_infos.json file with the metadata.

After uploading these two revised files, I am trying to test redownloading the dataset in Python again to make sure everything is working, and it is failing due to a checksum issue in the metadata. In particular, I am running this command (the dataset is private until we get it all figured out, so it has a use_auth_token argument too):

ds = datasets.load_dataset('microsoft/stackoverflow_2022', split='test', use_auth_token='<...>', download_mode='force_redownload')

and while trying to redownload the dataset, it fails with this error:

  File "<...>/info_utils.py", line 40, in verify_checksums
    raise NonMatchingChecksumError(error_msg + str(bad_urls))
datasets.utils.info_utils.NonMatchingChecksumError: Checksums didn't match for dataset source files:
['stackoverflow_2022_data_set_cleaned.json']

In fact, if I go to the repository and completely delete dataset_infos.json from it, I still get this error, which to me implies that the dataset_infos.json file is being cached somewhere, and I have to forcibly redownload it doing something similar to download_mode='force_redownload'. That argument works for the dataset JSON file itself, but it is not working for dataset_infos.json. Is there something else I need to run, or is there actually a problem with that file?

Thanks!

Hi! Yes, this file should be cached under: ~/.cache/huggingface/modules/datasets_modules/datasets/microsoft---stackoverflow_2022/<module_hash>. Setting download_mode="force_redownload" in load_dataset also updates this infos file from my tests, so perhaps you can update your installation of datasets (pip install -U datasets) and try again?

There was something wrong with the actual checksums it turns out. I still haven’t figured out what it is, but just removing the dataset_infos.json file and removing the cached version allowed us to load the dataset again. I don’t know if we’re going to want that file back in the future – we might not be worried about checksum validation for this particular dataset. It isn’t a training dataset or anything like that, just a set of data to run tests on. But for right now we will just not use that file and load the dataset without checksum validation, because no matter what I do generating that file, it comes back and says the checksums don’t match.

(For what it’s worth – it did work just fine the very first time I generated the file, but after making a small change to the dataset itself, I have not been able to get a working checksum again.)

1 Like