Can I make the interleave dataset for the longest one

I used to make composed datasets using the ā€œinterleave_datasets.ā€ I know it is very useful and powerful for getting a large dataset with huggingface hub. But I have a question: Is it possible to ā€œinterleaveā€ a group of datasets with the most extended length of datasets, not the shortest?

Here is a example:

from datasets import interleave_datasets, Dataset

d1 = Dataset.from_dict({"a": [1,2,3,4] })
d2 = Dataset.from_dict({"a": [100,200] })
interleaved = interleave_datasets([d1,d2])
print(len(interleaved))
>>> 4 
# it is because the dataset constructed as  {'a': 1}, {'a': 100}, {'a': 2}, {'a': 200}.

But I want to make a dataset like this,

{'a': 1}, {'a': 100}, {'a': 2}, {'a': 200},  {'a': 3}, {'a': 100}, {'a': 4}, {'a': 200}  
## the length of the dataset is 8 (with cycling the short one).

How can I make this with huggingface library?

Hi! We have an open PR that adds this feature: Add oversampling strategies to interleave datasets by ylacombe Ā· Pull Request #4831 Ā· huggingface/datasets Ā· GitHub. Feel free to comment on it to suggest improvements, etc.

2 Likes