Does datasets.load_dataset not support a seed?

Does datasets.load_dataset not support a seed to fix the random split of train/test/val? Would that not make sense as a feature? Working with kaggle/colab I’m reloading the dataset over and over and each time getting a different split which makes comparisons across runs a little tricky …

I would expect setting all seeds (eg numpy, random, torch, etc) before you load your dataset should do the trick, but you have to restart your kernel between runs.

1 Like

Which dataset is it ? Dataset loading is deterministic unless the dataset loading script uses some randomness explicitly (or if it uses non deterministic python functions)

1 Like

It’s the pubhealth dataset health_fact · Datasets at Hugging Face

It don’t see anything random in this script a priori, what differences do you observe between runs ?

ah I was seeing slightly different performance f1/accuracy metrics - I guess the answer from @deathcrush might be it then - it’s randomness in other parts of the system rather than the dataset training split.

But it seems confusing if different datasets can have a different approach to splitting that data, or can we generally assume the splits will be the same? It makes sense for things to be that way so people can accurately compare performance, but then again there are different splits depending on stratification etc. …

Randomly splitting a dataset are done by users - not in dataset scripts.

that sounds good - so we can rely on huggingface datasets always supplying the same train/test/validation split?

Yep exactly

1 Like