I also want to fine-tune blenderbot with custom data. I noticed the blended_skill_talk dataset page says that blenderbot weights use that to train on. So, it might make sense to structure our own custom training data like the blended_skill_talk dataset. But, I’m not sure, since when I tried to train on the blended_skill_talk dataset, like this:
from transformers import BlenderbotForConditionalGeneration, Trainer, TrainingArguments
from datasets import load_dataset
mname = 'facebook/blenderbot-400M-distill'
model = BlenderbotForConditionalGeneration.from_pretrained(mname).to('cuda')
train_dataset = load_dataset("blended_skill_talk", split="train")
val_dataset = load_dataset("blended_skill_talk", split="validation")
training_args = TrainingArguments("tmp_trainer")
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset,
I get an error (ValueError: too many dimensions ‘str’). I think it has to do with the collator, but I’m not sure how to fix this.