Hi, I am trying to fine-tune T5 model for translation, however it seems that even though the pairs of sentences look ok after being tokenized there is something wrong with it and I am getting
AssertionError: You should supply an encoding or a list of encodings to this method.
My dataset is pairs of english and french strings like:
“translate English to French: Is this realistic?” , “Est-ce réaliste?”
This is my code:
dataset = pd.read_excel('en-fr.xlsx') checkpoint = 't5-base' #tokenizer tokenizer = AutoTokenizer.from_pretrained(checkpoint) inputs = tokenizer(list(dataset['eng']), list(dataset['fr']), padding='longest', truncation=True, return_tensors='pt') data_collator = DataCollatorWithPadding(tokenizer=tokenizer) #model model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint) #arguments training_args = TrainingArguments('trainer_args') #trainer trainer = Trainer(model=model, args=training_args, train_dataset=inputs, data_collator=data_collator) trainer.train()
Thank you in advance for any advice.