A couple of questions about interleave_datasets()

Altabus · October 24, 2023, 8:19am

I would like to fuse several datasets of parallel texts into one, because one is not enough to adequately train the model

I’m using interleave_datasets but I feel like I’m doing something wrong

Firstly, I don’t understand why, when merging two datasets, it turns out to be such an unpredictable size. Usually the result is several times larger than its components, (although I would expect their sum)

d1 = load_dataset('opus100', 'de-en') # num_rows: 1,000,000
d2 = load_dataset('ted_talks_iwslt', language_pair=("de", "en"), year="2016") # num_rows: 3662

dataset = interleave_datasets([d1['train'], d2['train']], stopping_strategy="all_exhausted")

dataset # 2,000,000 (~3 times their sum)

#-------

d1 = load_dataset('opus100', 'de-ru') # num_rows: 2000
d2 = load_dataset('ted_talks_iwslt', language_pair=("de", "ru"), year="2016") # num_rows: 3650
d3 = load_dataset('news_commentary', 'de-ru') # num_rows: 113 117

dataset = interleave_datasets([d1['train'], d2['train'], d3['train']], stopping_strategy="all_exhausted")

dataset # num_rows: 527,715 (~5 times their sum)

The documentation says:

probabilities (List[float], optional, defaults to None) — If specified, the new dataset is constructed by sampling examples from one source at a time according to these probabilities.

I don’t understand this, all I understand is that the parameters can be endlessly randomized until the model gets better, but can someone explain it better? And it would be great if you could tell me how best to use this for the translation model

Perhaps it would be better not to merge all the datasets into one, but simply retrain the finished model first on one, then train the last checkpoint of this model on the next one, and so on?

Sorry for the long text and Thanks to everyone who answered at least one question!

lhoestq · October 24, 2023, 9:05am

Hi ! If you don’t specify probabilities, then it will alternate between each dataset, one example at a time.

With the “all_exhausted” strategy, it means it will loop multiple times through the smaller datasets, and once on the biggest dataset.
So if you have two datasets and the biggest has 1M rows, the resulting dataset will alternate between the two datasets until the 1M rows of the big datasets are exhausted (and loop multiple times on the small dataset.
Therefore you end up with 2M rows.

Altabus · October 26, 2023, 12:03pm

So is it good or bad that it will go through small datasets several times? Maybe it’s better not to add datasets that are 100 times different in size at all? Or maybe not use interleave_datasets, but simply train the model on all datasets in turn?

Thanks

lhoestq · October 26, 2023, 12:19pm

So is it good or bad that it will go through small datasets several times?

It depends on what you’d like to achieve. Sometimes it’s good to oversample quality datasets (e.g. it’s common to repeat wikipedia when pretraining LLMs for knowledge and text quality). But sometimes you don’t want your model to overfit on certain data, which can happen if you repeat them a lot.

If you want more control how many times a dataset is looped over you can specify the probabilities of sampling from one dataset or the other in interleave_datasets.

Altabus · October 27, 2023, 10:56am

I understand that everything is very situational, but what do you think would be best for training a translator model?

And what I really want to understand is the difference in approaches to use interleave_datasets(d1, d2, d3, d4, d5) or train on all datasets one by one, how will this potentially affect the quality of the translator model? It seems to me that for a translator it doesn’t matter much how many times he sees the same translation option

Thank you for your time

lhoestq · October 29, 2023, 3:11pm

I understand that everything is very situational, but what do you think would be best for training a translator model?

I think you’ll have to do some experiments to find out, or refer to the literature. I’m not an expert in translation myself, sorry

And what I really want to understand is the difference in approaches to use interleave_datasets(d1, d2, d3, d4, d5) or train on all datasets one by one, how will this potentially affect the quality of the translator model?

Good question. Personally I’ve only seen experiments reports using interleaved datasets, and not consecutive trainings so I’d rather do that as well to stay close to research setups that have already been studied and proven successful.

kdcyberdude · March 27, 2024, 9:42pm

Hi @lhoestq, What would you recommend to do for the following use case -

d1 = Dataset.from_dict({"a": [0, 1, 2]})
d2 = Dataset.from_dict({"a": [10, ]})
d3 = Dataset.from_dict({"a": [20, 21, ]})
dataset = interleave_datasets([d1, d2, d3], seed=42, stopping_strategy="all_exhausted")
print(dataset['a'])
# Output: [0, 10, 20, 1, 10, 21, 2, 10, 20]

Required Behaviour:
dataset = interleave_datasets([d1, d2, d3], seed=42, stopping_strategy="all_unique_exhausted")
print(dataset['a'])
# Expected Output: [0, 10, 20, 1, 21, 2] # basically want to use this for shuffling among multilingual datasets for LLM training. Where data length distribution varies a lot.

Could you recommend an approach to achieve the expected behaviour?

lhoestq · March 28, 2024, 10:44am

This strategy would always yield examples from the same dataset at the end (when the others are exhausted), which is maybe not ideal if you want to train on multilingual datasets and see all the languages potentially present at every step of you training.

Have you considered passing probabilities= to interleave_datasets e.g. to oversample the languages with more data ?

Topic		Replies	Views
Using interleave_datasets with probabilities 🤗Datasets	1	424	January 27, 2024
How does one fix an interleaved data set from only sampling one data set? Beginners	1	364	August 14, 2023
Can I make the interleave dataset for the longest one 🤗Datasets	1	1358	August 12, 2022
Yielding items from multiple datasets in parallel 🤗Datasets	4	845	February 8, 2024
How to sample batches from multiple datasets? 🤗Datasets	2	1945	January 18, 2024

A couple of questions about interleave_datasets()

Related topics