A couple of questions about interleave_datasets()

I would like to fuse several datasets of parallel texts into one, because one is not enough to adequately train the model

I’m using interleave_datasets but I feel like I’m doing something wrong

  1. Firstly, I don’t understand why, when merging two datasets, it turns out to be such an unpredictable size. Usually the result is several times larger than its components, (although I would expect their sum)
d1 = load_dataset('opus100', 'de-en') # num_rows: 1,000,000
d2 = load_dataset('ted_talks_iwslt', language_pair=("de", "en"), year="2016") # num_rows: 3662

dataset = interleave_datasets([d1['train'], d2['train']], stopping_strategy="all_exhausted")

dataset # 2,000,000 (~3 times their sum)

#-------

d1 = load_dataset('opus100', 'de-ru') # num_rows: 2000
d2 = load_dataset('ted_talks_iwslt', language_pair=("de", "ru"), year="2016") # num_rows: 3650
d3 = load_dataset('news_commentary', 'de-ru') # num_rows: 113 117

dataset = interleave_datasets([d1['train'], d2['train'], d3['train']], stopping_strategy="all_exhausted")

dataset # num_rows: 527,715 (~5 times their sum)
  1. The documentation says:

probabilities (List[float], optional, defaults to None) — If specified, the new dataset is constructed by sampling examples from one source at a time according to these probabilities.

I don’t understand this, all I understand is that the parameters can be endlessly randomized until the model gets better, but can someone explain it better? And it would be great if you could tell me how best to use this for the translation model

  1. Perhaps it would be better not to merge all the datasets into one, but simply retrain the finished model first on one, then train the last checkpoint of this model on the next one, and so on?

Sorry for the long text and Thanks to everyone who answered at least one question!

Hi ! If you don’t specify probabilities, then it will alternate between each dataset, one example at a time.

With the “all_exhausted” strategy, it means it will loop multiple times through the smaller datasets, and once on the biggest dataset.
So if you have two datasets and the biggest has 1M rows, the resulting dataset will alternate between the two datasets until the 1M rows of the big datasets are exhausted (and loop multiple times on the small dataset.
Therefore you end up with 2M rows.

So is it good or bad that it will go through small datasets several times? Maybe it’s better not to add datasets that are 100 times different in size at all? Or maybe not use interleave_datasets, but simply train the model on all datasets in turn?

Thanks

So is it good or bad that it will go through small datasets several times?

It depends on what you’d like to achieve. Sometimes it’s good to oversample quality datasets (e.g. it’s common to repeat wikipedia when pretraining LLMs for knowledge and text quality). But sometimes you don’t want your model to overfit on certain data, which can happen if you repeat them a lot.

If you want more control how many times a dataset is looped over you can specify the probabilities of sampling from one dataset or the other in interleave_datasets.

I understand that everything is very situational, but what do you think would be best for training a translator model?

And what I really want to understand is the difference in approaches to use interleave_datasets(d1, d2, d3, d4, d5) or train on all datasets one by one, how will this potentially affect the quality of the translator model? It seems to me that for a translator it doesn’t matter much how many times he sees the same translation option

Thank you for your time

I understand that everything is very situational, but what do you think would be best for training a translator model?

I think you’ll have to do some experiments to find out, or refer to the literature. I’m not an expert in translation myself, sorry

And what I really want to understand is the difference in approaches to use interleave_datasets(d1, d2, d3, d4, d5) or train on all datasets one by one, how will this potentially affect the quality of the translator model?

Good question. Personally I’ve only seen experiments reports using interleaved datasets, and not consecutive trainings so I’d rather do that as well to stay close to research setups that have already been studied and proven successful.

Hi @lhoestq, What would you recommend to do for the following use case -

d1 = Dataset.from_dict({"a": [0, 1, 2]})
d2 = Dataset.from_dict({"a": [10, ]})
d3 = Dataset.from_dict({"a": [20, 21, ]})
dataset = interleave_datasets([d1, d2, d3], seed=42, stopping_strategy="all_exhausted")
print(dataset['a'])
# Output: [0, 10, 20, 1, 10, 21, 2, 10, 20]

Required Behaviour:
dataset = interleave_datasets([d1, d2, d3], seed=42, stopping_strategy="all_unique_exhausted")
print(dataset['a'])
# Expected Output: [0, 10, 20, 1, 21, 2] # basically want to use this for shuffling among multilingual datasets for LLM training. Where data length distribution varies a lot.

Could you recommend an approach to achieve the expected behaviour?

This strategy would always yield examples from the same dataset at the end (when the others are exhausted), which is maybe not ideal if you want to train on multilingual datasets and see all the languages potentially present at every step of you training.

Have you considered passing probabilities= to interleave_datasets e.g. to oversample the languages with more data ?