How to group sentences in dataset for muti-turn dialogue conversation?

samarendra109 · July 26, 2023, 1:09am

I am trying to train an encoder decoder model on the dataset empathetic_dialogues · Datasets at Hugging Face .

The dialogue is formatted as follows,

Here the conv_id indicates an unique conversation, and the speaker_idx denote the speaker and the listener. I would like to group the utterances as follows,

For utterance index 1: input is … utterance1 …
For utterance index 2: input is … utterance1 … … utterance2 … … utterance2 …
and so on.

Is there a way to achieve this in huggingface datasets without transforming it into dataframe and back? A subsequent question is, what is the general pipeline followed in industry for training such an multi-turn dialogue agent.

Thanks in advance for the help. This is my first question in the forum. If I have made some mistakes please let me know. I will quickly correct it.

Topic		Replies	Views
Looking for Mental Health Support Datasets for building a Multi-turn Chatbot 🤗Datasets	6	2420	September 21, 2024
How to create a dataset for "audio-like" files for ASR Beginners	0	402	April 10, 2023
Seeking Guidance on Creating and Training a Model with a Specific Dataset Beginners	4	500	February 2, 2024
Help with starting to write a Casual Chatbot AI Beginners	5	1943	November 9, 2024
Concatenate Sentances Beginners	0	414	March 8, 2021

How to group sentences in dataset for muti-turn dialogue conversation?

Related topics