Hi, I am trying to fine-tune T5 model for translation, however it seems that even though the pairs of sentences look ok after being tokenized there is something wrong with it and I am getting
AssertionError: You should supply an encoding or a list of encodings to this method.
My dataset is pairs of english and french strings like:
“translate English to French: Is this realistic?” , “Est-ce réaliste?”
This is my code:
dataset = pd.read_excel('en-fr.xlsx')
checkpoint = 't5-base'
#tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(list(dataset['eng']), list(dataset['fr']), padding='longest', truncation=True, return_tensors='pt')
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
#model
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
#arguments
training_args = TrainingArguments('trainer_args')
#trainer
trainer = Trainer(model=model, args=training_args, train_dataset=inputs, data_collator=data_collator)
trainer.train()
Thank you in advance for any advice.