Training Object detection then feature extraction problem

Hello, My project involves creating a search engine for images to find exact or similar labels.
My idea is to use the same concept like facial recognition, detect the object in this case the label on the image then extract the features to create the meta-data for the database.
Is it ok to use YOLO as object detection then use feature extractor like vgg19 or vit-base to create the meta-data?
Additionally, The labels has text which I can use for OCR model to extract the text then make this as the meta-data. My only concern is that there are logos or markings that will not be read on the labels which could help on the searching.
What do you think is the right approach and what step or model should I try?