Nested datasets and oversampling

Does datasets support nested datasets? I have a nested dataset path dict, e.g.,

{
	a: {
		train:
		dev:
		test:
	},
	b: {
		train:
		dev:
		test:
	}
}

and load_dataset seems unhappy with this structure (TypeError: expected str, bytes or os.PathLike object, not dict). Does that imply I have to do load_dataset for each sub-dataset and then merge them together?

Further, is there any support for over/undersampling given multiple (sub-)datasets? Thanks a lot!

Hi !
Indeed load_dataset is used to load only one dataset. This dataset can have several splits though.
In your case you have to call load_dataset once per dataset.

Regarding over/undersampling, we are adding an interleave_datasets functions that creates a new dataset from several datasets. By default it alternates between each original dataset, but you can specify sampling probabilities to over/undersample from the original datasets :slight_smile:
It will be available in the next release of the library.

What you can do in the meantime is use a mix of shuffle and conatenate_datasets.
If you have dataset1 and dataset2 and want to get dataset3 with 10 examples from datasets1 and 90 examples from dataset2 you can do:

from datasets import concatenate_datasets

seed = 42
dataset3 = concatenate_datasets([
    dataset1.shuffle(seed=seed).select(range(10)),
    dataset2.shuffle(seed=seed).select(range(90)),
])
1 Like

Thanks for the reply! Regarding shuffling, my understanding is that by default when using trainer the dataset will be shuffled and then batched. So if I don’t need sampling, I could just concat all datasets and pass it to the trainer. Is that right?

I don’t think the trainer shuffles the data, so you should probably shuffle the dataset after concatenation

EDIT: actually it does, see thom’s comment

But it doesn’t seem to have any explicit mention of shuffling in the official examples, for which if there is no default shuffling in trainer then there should be issues (e.g., there is no “shuffl*” in this example)?

Yes the HF trainer shuffle the examples (see here for instance: transformers/trainer.py at 4605b2b8ec5512a5ea125773bcaa4b0014b32d50 · huggingface/transformers · GitHub)

1 Like