Long-term reproducibility for `load_dataset`?

I’m running into an issue where some fairly-basic code, which definitely worked about 10 months ago, now produces an error message. The Docker image where the code runs has not changed, so it must be some invisible change from the Hugging Face side

Here is the entire code snippet:

from datasets import load_dataset
dataset = load_dataset("sms_spam")

Previously, this retrieved the sms_spam dataset. Now, it produces a KeyError: 'tags'.

My package versions are:

datasets==2.18.0
huggingface-hub==0.21.4
transformers==4.36.0

I see another thread flagging this same KeyError, and the suggested solution is to upgrade datasets and huggingface_hub Huggingface dataset install - #4 by John6666

I will follow this advice as a temporary solution, but it’s not a permanent fix.

I am trying to write Hugging Face code that will keep working for years, not stop working after a couple of months!

Can anyone explain:

  1. Whether there is any way to prevent this from happening again? For example, can I “pin” the dataset itself to a particular version? Or some other argument, e.g. the URL that is called when the dataset is loaded? I suppose the alternative is downloading and saving all datasets locally, but that’s less convenient, and convenience is why I’m using load_dataset in the first place
  2. Why this stopped working? For example, is there a particular discussion or commit message where I can read about how it was decided to make old versions of these packages stop working? Even if it’s not preventable, if I can understand where these conversations are happening, that will let me be less reactive, more proactive

Thanks!

1 Like

If there is a place where this discussion is taking place, I think it is github.
It would be in the Issue or Discussion section.

Hi ! This old huggingface_hub version has an issue that makes it not robust to recent changes on HF, you can fix the issue by using the latest huggingface_hub version

1 Like