Nested datasets and oversampling

mralexis · June 16, 2021, 3:19pm

Does datasets support nested datasets? I have a nested dataset path dict, e.g.,

{
	a: {
		train:
		dev:
		test:
	},
	b: {
		train:
		dev:
		test:
	}
}

and load_dataset seems unhappy with this structure (TypeError: expected str, bytes or os.PathLike object, not dict). Does that imply I have to do load_dataset for each sub-dataset and then merge them together?

Further, is there any support for over/undersampling given multiple (sub-)datasets? Thanks a lot!

lhoestq · June 28, 2021, 10:09am

Hi !
Indeed load_dataset is used to load only one dataset. This dataset can have several splits though.
In your case you have to call load_dataset once per dataset.

Regarding over/undersampling, we are adding an interleave_datasets functions that creates a new dataset from several datasets. By default it alternates between each original dataset, but you can specify sampling probabilities to over/undersample from the original datasets
It will be available in the next release of the library.

What you can do in the meantime is use a mix of shuffle and conatenate_datasets.
If you have dataset1 and dataset2 and want to get dataset3 with 10 examples from datasets1 and 90 examples from dataset2 you can do:

from datasets import concatenate_datasets

seed = 42
dataset3 = concatenate_datasets([
    dataset1.shuffle(seed=seed).select(range(10)),
    dataset2.shuffle(seed=seed).select(range(90)),
])

mralexis · June 29, 2021, 6:11am

Thanks for the reply! Regarding shuffling, my understanding is that by default when using trainer the dataset will be shuffled and then batched. So if I don’t need sampling, I could just concat all datasets and pass it to the trainer. Is that right?

lhoestq · June 30, 2021, 12:54pm

I don’t think the trainer shuffles the data, so you should probably shuffle the dataset after concatenation

EDIT: actually it does, see thom’s comment

mralexis · July 1, 2021, 4:00am

But it doesn’t seem to have any explicit mention of shuffling in the official examples, for which if there is no default shuffling in trainer then there should be issues (e.g., there is no “shuffl*” in this example)?

thomwolf · July 5, 2021, 8:00pm

Yes the HF trainer shuffle the examples (see here for instance: transformers/trainer.py at 4605b2b8ec5512a5ea125773bcaa4b0014b32d50 · huggingface/transformers · GitHub)

Topic		Replies	Views
How does one fix an interleaved data set from only sampling one data set? Beginners	1	362	August 14, 2023
Load_dataset assumes 'train' Beginners	2	932	May 31, 2023
`load_dataset` results in OOM 🤗Datasets	0	177	June 25, 2024
Loading multiple custom splits using `load_dataset('audiofolder', data_dir=/some/path)` Beginners	4	771	November 13, 2023
Good way to reshaffle/reacreate dataloader content? 🤗Accelerate	0	308	March 18, 2023

Nested datasets and oversampling

Related topics