Can I make the interleave dataset for the longest one

psyche · August 2, 2022, 11:34am

I used to make composed datasets using the “interleave_datasets.” I know it is very useful and powerful for getting a large dataset with huggingface hub. But I have a question: Is it possible to “interleave” a group of datasets with the most extended length of datasets, not the shortest?

Here is a example:

from datasets import interleave_datasets, Dataset

d1 = Dataset.from_dict({"a": [1,2,3,4] })
d2 = Dataset.from_dict({"a": [100,200] })
interleaved = interleave_datasets([d1,d2])
print(len(interleaved))
>>> 4 
# it is because the dataset constructed as  {'a': 1}, {'a': 100}, {'a': 2}, {'a': 200}.

But I want to make a dataset like this,

{'a': 1}, {'a': 100}, {'a': 2}, {'a': 200},  {'a': 3}, {'a': 100}, {'a': 4}, {'a': 200}  
## the length of the dataset is 8 (with cycling the short one).

How can I make this with huggingface library?

mariosasko · August 12, 2022, 12:51pm

Hi! We have an open PR that adds this feature: Add oversampling strategies to interleave datasets by ylacombe · Pull Request #4831 · huggingface/datasets · GitHub. Feel free to comment on it to suggest improvements, etc.

Topic		Replies	Views
Desired behavior when calling `shuffle` or `select` on `interleave_datasets` 🤗Datasets	1	412	July 20, 2021
A couple of questions about interleave_datasets() 🤗Datasets	7	1845	March 28, 2024
Train through multiple datasets Beginners	1	1630	June 13, 2022
How to sample batches from multiple datasets? 🤗Datasets	2	1936	January 18, 2024
Making an infinite IterableDataset 🤗Datasets	6	92	March 19, 2025

Can I make the interleave dataset for the longest one

Related topics