I am trying to train an encoder decoder model on the dataset empathetic_dialogues · Datasets at Hugging Face .
The dialogue is formatted as follows,
Here the conv_id indicates an unique conversation, and the speaker_idx denote the speaker and the listener. I would like to group the utterances as follows,
For utterance index 1: input is … utterance1 …
For utterance index 2: input is … utterance1 … … utterance2 … … utterance2 …
and so on.
Is there a way to achieve this in huggingface datasets without transforming it into dataframe and back? A subsequent question is, what is the general pipeline followed in industry for training such an multi-turn dialogue agent.
Thanks in advance for the help. This is my first question in the forum. If I have made some mistakes please let me know. I will quickly correct it.