Using BERT for labels categorization


I’ll try to describe task I’m trying to solve and how I’m solving it. It’s a little bit strange I suppose, but whatever.
Task is: given images. Given distributions on this images on labels. See example:
It’s a data from Google Cloud Vision API — models tells us what object (‘description’) on the image and what is the confidence (‘score’).
Also given by business some categories. I need to map every image to every given category.
My approach is: represent each label on pretrained base-bert-uncased, represent each category we’re given. Then use UMAP for dimension reduction, clusterize the labels, then map centers of clusters to a given category using cosine-similarity.
Problems: Quality is poor. I have matrice of similarities and there is similar values — near the 0.9 so it makes hard to map each image to each of the cluster.
I made an assumption that BERT embeddings will be good spot for representing thats images labels.
On the other hand, I applied BERTopic-approach and it is working not so bad.

What can I do for my approach? Use fine-tuning? Since I don’t have marked up data for classification fine-tuning, am I need to mark it manually? What amount of data will be enough? Now I have near 40k images in my dataset. Mark up it with hands will be pretty bad though…

Any help/advice will be appreciated

There is my code on kaggle if you’d like to watch it: