Long-term reproducibility for `load_dataset`?

erinudacity · January 7, 2025, 8:30pm

I’m running into an issue where some fairly-basic code, which definitely worked about 10 months ago, now produces an error message. The Docker image where the code runs has not changed, so it must be some invisible change from the Hugging Face side

Here is the entire code snippet:

from datasets import load_dataset
dataset = load_dataset("sms_spam")

Previously, this retrieved the sms_spam dataset. Now, it produces a KeyError: 'tags'.

My package versions are:

datasets==2.18.0
huggingface-hub==0.21.4
transformers==4.36.0

I see another thread flagging this same KeyError, and the suggested solution is to upgrade datasets and huggingface_hub Huggingface dataset install - #4 by John6666

I will follow this advice as a temporary solution, but it’s not a permanent fix.

I am trying to write Hugging Face code that will keep working for years, not stop working after a couple of months!

Can anyone explain:

Whether there is any way to prevent this from happening again? For example, can I “pin” the dataset itself to a particular version? Or some other argument, e.g. the URL that is called when the dataset is loaded? I suppose the alternative is downloading and saving all datasets locally, but that’s less convenient, and convenience is why I’m using load_dataset in the first place
Why this stopped working? For example, is there a particular discussion or commit message where I can read about how it was decided to make old versions of these packages stop working? Even if it’s not preventable, if I can understand where these conversations are happening, that will let me be less reactive, more proactive

Thanks!

John6666 · January 8, 2025, 2:32am

If there is a place where this discussion is taking place, I think it is github.
It would be in the Issue or Discussion section.

lhoestq · January 8, 2025, 2:54pm

Hi ! This old huggingface_hub version has an issue that makes it not robust to recent changes on HF, you can fix the issue by using the latest huggingface_hub version

Topic		Replies	Views
Issue with huggingface.load_dataset() Spaces	4	2492	January 8, 2025
Huggingface dataset install 🤗Datasets	13	2450	January 15, 2025
I keep getting KeyError:'safe' when loading my datasets Beginners	29	2761	September 12, 2024
Unable to load images 🤗Datasets	2	141	December 31, 2024
Can’t generate my own dataset using load_dataset Beginners	1	171	May 7, 2024

Long-term reproducibility for `load_dataset`?

Related topics