How to rewrite this code?

Saw this great blog and tried to reproduce it. Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models
The blog handwrites the train step. I noticed that transformer.Trainer() only take tokenizer as parameter. Wonder if something similar to Trainer() can take a processor?