Does datasets.load_dataset not support a seed?

tansaku · March 23, 2023, 2:59pm

Does datasets.load_dataset not support a seed to fix the random split of train/test/val? Would that not make sense as a feature? Working with kaggle/colab I’m reloading the dataset over and over and each time getting a different split which makes comparisons across runs a little tricky …

deathcrush · March 23, 2023, 3:27pm

I would expect setting all seeds (eg numpy, random, torch, etc) before you load your dataset should do the trick, but you have to restart your kernel between runs.

lhoestq · March 24, 2023, 10:32am

Which dataset is it ? Dataset loading is deterministic unless the dataset loading script uses some randomness explicitly (or if it uses non deterministic python functions)

tansaku · March 24, 2023, 10:50am

It’s the pubhealth dataset health_fact · Datasets at Hugging Face

lhoestq · March 24, 2023, 11:12am

It don’t see anything random in this script a priori, what differences do you observe between runs ?

tansaku · March 24, 2023, 12:05pm

ah I was seeing slightly different performance f1/accuracy metrics - I guess the answer from @deathcrush might be it then - it’s randomness in other parts of the system rather than the dataset training split.

But it seems confusing if different datasets can have a different approach to splitting that data, or can we generally assume the splits will be the same? It makes sense for things to be that way so people can accurately compare performance, but then again there are different splits depending on stratification etc. …

lhoestq · March 24, 2023, 1:06pm

Randomly splitting a dataset are done by users - not in dataset scripts.

tansaku · March 24, 2023, 2:44pm

that sounds good - so we can rely on huggingface datasets always supplying the same train/test/validation split?

lhoestq · March 24, 2023, 3:32pm

Yep exactly

Topic		Replies	Views
Confusion in splitting dataset (from imagefolder) into train, test and validation 🤗Datasets	2	5728	August 12, 2022
Load_dataset split='test' not working 🤗Datasets	2	886	February 8, 2024
`train_test_split` with IterableDataset 🤗Datasets	2	1809	January 26, 2023
Load_dataset split=‘test’ not working again Beginners	3	26	April 19, 2025
`load_from_cache_file` not working 🤗Datasets	1	2145	May 10, 2021

Does datasets.load_dataset not support a seed?

Related topics