I would like to fuse several datasets of parallel texts into one, because one is not enough to adequately train the model
I’m using interleave_datasets but I feel like I’m doing something wrong
Firstly, I don’t understand why, when merging two datasets, it turns out to be such an unpredictable size. Usually the result is several times larger than its components, (although I would expect their sum)
probabilities (List[float], optional, defaults to None) — If specified, the new dataset is constructed by sampling examples from one source at a time according to these probabilities.
I don’t understand this, all I understand is that the parameters can be endlessly randomized until the model gets better, but can someone explain it better? And it would be great if you could tell me how best to use this for the translation model
Perhaps it would be better not to merge all the datasets into one, but simply retrain the finished model first on one, then train the last checkpoint of this model on the next one, and so on?
Sorry for the long text and Thanks to everyone who answered at least one question!
Hi ! If you don’t specify probabilities, then it will alternate between each dataset, one example at a time.
With the “all_exhausted” strategy, it means it will loop multiple times through the smaller datasets, and once on the biggest dataset.
So if you have two datasets and the biggest has 1M rows, the resulting dataset will alternate between the two datasets until the 1M rows of the big datasets are exhausted (and loop multiple times on the small dataset.
Therefore you end up with 2M rows.
So is it good or bad that it will go through small datasets several times? Maybe it’s better not to add datasets that are 100 times different in size at all? Or maybe not use interleave_datasets, but simply train the model on all datasets in turn?
So is it good or bad that it will go through small datasets several times?
It depends on what you’d like to achieve. Sometimes it’s good to oversample quality datasets (e.g. it’s common to repeat wikipedia when pretraining LLMs for knowledge and text quality). But sometimes you don’t want your model to overfit on certain data, which can happen if you repeat them a lot.
If you want more control how many times a dataset is looped over you can specify the probabilities of sampling from one dataset or the other in interleave_datasets.
I understand that everything is very situational, but what do you think would be best for training a translator model?
And what I really want to understand is the difference in approaches to use interleave_datasets(d1, d2, d3, d4, d5) or train on all datasets one by one, how will this potentially affect the quality of the translator model? It seems to me that for a translator it doesn’t matter much how many times he sees the same translation option
I understand that everything is very situational, but what do you think would be best for training a translator model?
I think you’ll have to do some experiments to find out, or refer to the literature. I’m not an expert in translation myself, sorry
And what I really want to understand is the difference in approaches to use interleave_datasets(d1, d2, d3, d4, d5) or train on all datasets one by one, how will this potentially affect the quality of the translator model?
Good question. Personally I’ve only seen experiments reports using interleaved datasets, and not consecutive trainings so I’d rather do that as well to stay close to research setups that have already been studied and proven successful.