Suggestions on ideal model architecture for sentence correction?

Hi! I am exploring sentence transformers for a visual scene detection application, to correct automated close captioning according to what is found in the analyzed video frame. For example, if the video frame depicts a man moving his head but the automated video caption states “man moving hand”, using computer vision-based methods to provide context for a language model which then corrects the caption to “man moving head”.

So the thought was to train a language model on the YouTube Caption Corrections dataset, then to somehow tokenize or provide as context the labels associated with the vision transformer or object detection pipeline which analyzes the video frame at that caption timestamps and then does scene identification / object detection in the analyzed frame. Some of the sentence correction models out there utilize token masking to determine the best “fit” from a dictionary of proposed replacement words, the idea would be to populate that dictionary with context retrieved from the vision models.

Any ideas on the ideal model architecture would be for this from a language model perspective?