How to train blended_skill_talk with transformers.trainer?

hi,
I am trying to train blenderbot with blended_skill_talk but got some error below:

my code

...
dataset_train = load_dataset("blended_skill_talk", split="train")
dataset_validation = load_dataset("blended_skill_talk", split="validation")
dataset_test = load_dataset("blended_skill_talk", split="test")

tokenizer = BlenderbotTokenizer.from_pretrained("facebook/blenderbot-400M-distill", use_fast=True, model_max_length=512)
model = BlenderbotForConditionalGeneration.from_pretrained("facebook/blenderbot-400M-distill")
training_args = TrainingArguments("output-fb", remove_unused_columns=False)

**NOT sure if this tokenize_func is correct or wrong.the dataset has these columns as array too**
def tokenize_func(examples):
	return tokenizer(examples["personas"], examples["context"], examples["previous_utterance"], examples["free_messages"], examples["guided_messages"], examples["suggestions"], examples["guided_chosen_suggestions"],)

dataset_train = dataset_train.map(tokenize_func, batched=True)

error (it seems a few first data tokenization worked but failed in the middle)

ValueError                                Traceback (most recent call last)
ValueError: [{'convai2': ["i love acting ! i'll be famous someday . what do you do ?", 'no no kids , might get some though . one day', 'that is great . i am going to a concert later', '15 and 17 , two boys sooo fun', 'they really are . and a handful at times', 'it can be sometimes . i bet being a doctor is a lot of work too .'], 'empathetic_dialogues': ['Any favorite actors?', 'One day.', 'How long must you attend school?', '4 and 5 and I have a teenager', 'They are most of the time!', "Oh. I don't know how medical school works. I am studying srt history."], 'wizard_of_wikipedia': ['I would like to develop my acting skills. What are some tips you have to not get nervous?', 'I will still wimp out. i want to be famous like the rolling stones  though.', 'good', "Close to 30! I just always have to put in a ton of work when mother's day comes around haha", 'They are actually very good with kids!', 'yeah but there are a lot of programs that help!']}, {'convai2': ['yum . i like to make lasagna and it s so good', 'yes ! trying to master lasagna .', 'it beats ramen noodles for sure ! do you have any hobbies ?', 'piercings are cool . i do not have any though .', "i don't know . whatever i want . maybe chicken", 'it would be a fashion statement . my dad would not like it .'], 'empathetic_dialogues': ['Cool. I love italian. Real italian.', "See. I'm not a great cook.", 'I love coffee, actually. I drink a few cups every morning!', 'Thats awesome i used to do the tattoos and ...

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-86-ff766f649b51> in <module>
      7 # dataset_validation = dataset_validation.map(lambda examples: tokenizer(examples["personas"], examples["context"]), batched=True)
      8 
----> 9 dataset_train = dataset_train.map(tokenize_func, batched=True)
     10 dataset_validation = dataset_validation.map(tokenize_func, batched=True)
     11 

15 frames
/usr/local/lib/python3.7/dist-packages/transformers/utils/generic.py in _missing_(cls, value)
    292     def _missing_(cls, value):
    293         raise ValueError(
--> 294             f"{value} is not a valid {cls.__name__}, please select one of {list(cls._value2member_map_.keys())}"
    295         )
    296

if the dataset (blended_skill_talk on HuggingFace dataset) has some issue, how could we filter the invalid data out?

Hi @Honam

It fails because examples[“personas”] and examples[“context”] do not have identical length.
these 2 lists of string need to be the same size.

Hope this help.