Does datasets
support nested datasets? I have a nested dataset path dict, e.g.,
{
a: {
train:
dev:
test:
},
b: {
train:
dev:
test:
}
}
and load_dataset
seems unhappy with this structure (TypeError: expected str, bytes or os.PathLike object, not dict
). Does that imply I have to do load_dataset
for each sub-dataset and then merge them together?
Further, is there any support for over/undersampling given multiple (sub-)datasets? Thanks a lot!
Hi !
Indeed load_dataset
is used to load only one dataset. This dataset can have several splits though.
In your case you have to call load_dataset
once per dataset.
Regarding over/undersampling, we are adding an interleave_datasets
functions that creates a new dataset from several datasets. By default it alternates between each original dataset, but you can specify sampling probabilities to over/undersample from the original datasets
It will be available in the next release of the library.
What you can do in the meantime is use a mix of shuffle
and conatenate_datasets
.
If you have dataset1
and dataset2
and want to get dataset3
with 10 examples from datasets1
and 90 examples from dataset2
you can do:
from datasets import concatenate_datasets
seed = 42
dataset3 = concatenate_datasets([
dataset1.shuffle(seed=seed).select(range(10)),
dataset2.shuffle(seed=seed).select(range(90)),
])
1 Like
Thanks for the reply! Regarding shuffling, my understanding is that by default when using trainer the dataset will be shuffled and then batched. So if I don’t need sampling, I could just concat all datasets and pass it to the trainer. Is that right?
I don’t think the trainer shuffles the data, so you should probably shuffle the dataset after concatenation
EDIT: actually it does, see thom’s comment
But it doesn’t seem to have any explicit mention of shuffling in the official examples, for which if there is no default shuffling in trainer then there should be issues (e.g., there is no “shuffl*” in this example)?