Suggestions on ideal model architecture for sentence correction?

jrandel · September 6, 2021, 11:05pm

Hi! I am exploring sentence transformers for a visual scene detection application, to correct automated close captioning according to what is found in the analyzed video frame. For example, if the video frame depicts a man moving his head but the automated video caption states “man moving hand”, using computer vision-based methods to provide context for a language model which then corrects the caption to “man moving head”.

So the thought was to train a language model on the YouTube Caption Corrections dataset, then to somehow tokenize or provide as context the labels associated with the vision transformer or object detection pipeline which analyzes the video frame at that caption timestamps and then does scene identification / object detection in the analyzed frame. Some of the sentence correction models out there utilize token masking to determine the best “fit” from a dictionary of proposed replacement words, the idea would be to populate that dictionary with context retrieved from the vision models.

Any ideas on the ideal model architecture would be for this from a language model perspective?

Topic		Replies	Views
Model Suggestion on Text correction Beginners	0	766	April 2, 2021
Fine tuning a sentence-transformer for cosine sim on 500k sentence pairs without labels-- advice 🤗Transformers	2	1200	April 20, 2024
Understanding Encoder-Decoder Transformer Architecture in Image Captioning Beginners	0	11	January 13, 2025
How to Train an Image Captioning Model for specific language Beginners	3	18	March 9, 2025
Training a SentenceTransformers for address simliarity Beginners	3	743	March 6, 2024

Suggestions on ideal model architecture for sentence correction?

Related topics