Dataset for training BlenderBot

I am trying to build a chatbot using BlenderbotForConditionalGeneration. I am using the pretrained model, however, I have to fine-tune it. My question is how should the training data look like and is there any tutorials how should I preprocess it in order to fine-tune the model?

Thank you! :hugs:

I also want to fine-tune blenderbot with custom data. I noticed the blended_skill_talk dataset page says that blenderbot weights use that to train on. So, it might make sense to structure our own custom training data like the blended_skill_talk dataset. But, I’m not sure, since when I tried to train on the blended_skill_talk dataset, like this:

from transformers import BlenderbotForConditionalGeneration, Trainer, TrainingArguments
from datasets import load_dataset

mname = 'facebook/blenderbot-400M-distill'
model = BlenderbotForConditionalGeneration.from_pretrained(mname).to('cuda')
train_dataset = load_dataset("blended_skill_talk", split="train")
val_dataset = load_dataset("blended_skill_talk", split="validation")
training_args = TrainingArguments("tmp_trainer")
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, 
eval_dataset=val_dataset)
trainer.train()

I get an error (ValueError: too many dimensions ‘str’). I think it has to do with the collator, but I’m not sure how to fix this.

1 Like