How to finetune an LLM with Image-Text pairs

I want to fine tune THUDM/cogvlm-chat-hf to add additional domain knowledge. I have a dataset of characters from a cartoon show, labeled with their names, and I want to improve the model’s recognition of these characters for captioning.

Is this possible with AutoTrain?

If not can anyone point me to some kind of tutorial, or give any kind of direction? The CogVLM documentation shows how to run the finetuning script, but I have not found any information about the format of the dataset needed.