How to preprocess dataset with multiple references

Hi, beginner here :wave:

I want to fine-tune a transformer (mT5) on a dataset in the :hugs: datasets library (TaTA) which has multiple references for some examples. I want to treat each one of these references as an individual training sample. How can I do this with the datasets library, or would I be better off just converting the dataset to a pandas dataframe and processing it that way?


This dataset is small, so processing it in Pandas sounds like a good idea. Then, you can simply convert it back into a Dataset (with Dataset.from_pandas).

Hi, I also have a similar kind of doubt. Suppose I want to fine-tune a model for the text generation task using multiple datasets
The first dataset is something like this: Open-Orca/OpenOrca
The second dataset is something like this: GAIR/lima
I want to concatenate both datasets and use them as large datasets.
So, how do you preprocess the data in this type of scenario?

Thanks @mariosasko

Hi @Pranavagrl, I’d say this is a separate issue. I haven’t looked at the two datasets but I don’t think concatenating them will work unless the inputs and references are in the exact same format across both datasets. You should probably train on one, then the other, but this isn’t something I’m very familiar with.

Okay, No problem
Thanks for your support.