Hi everyone
I’m new to the world of computer vision and would really appreciate some crowd wisdom.
Is there a way, using today’s tools and libraries, to categorize a folder full of images of places and buildings? For example, if I have a folder with 2 images of the Eiffel Tower, 3 images of Pisa, and 4 images of the Colosseum (for simplicity, let’s assume the images are taken from the same or very similar angles), can I write a code that will eventually sort these into 3 folders, each containing similar images? To clarify, I’m not talking about a model that recognizes specific landmarks like the Eiffel Tower , but rather one that organizes the images into folders based on their similarity to each other.
Thanks to everyone who helps!
Hello !
Yes you can ! The easiest/fastest way according to me is to use a generalist pretrained image model (ImageNet, Vision Transformers…) and apply them on you images one by one. After that, you aim at getting one single embedding per image (depending on the model you can get it differently).
Those vectors can be later used to fuel a clustering model such as a DBSCAN or a KMEANS, this would give you clusters of image that are, according to the model, close in the representation space.
Warning : in the clustering, use the cosine distance rather than the euclidean one.
Based on the clusters, you can create your files !
That’s so cool!! Do you think that it will be able to cluster images of similar looking streets/ rivers as well?
Because if you’ll give me a bunch of pictures and ask me to cluster them based on the location or landmark in them, I think I’ll do a pretty good job, but a script that does that automatically? Note that’s impressive!
I also have 2 questions, just do I’ll have a better understanding of this concept;
- What’s the point of the first step (embedding)? Do you know if some good resources to learn from?
- Out of curiosity - why not use the euclidean?
Thank you so much!
Yes, in terms of images you can try it on various topics, I think it will still have a decent performance. Similar streets/rivers can be done if there is still something to find in the data. For instance, if you give two pictures of a New York street and the main element to distinguish both of them is a specific text written somewhere : the model will perform poorly as it has a very generalistic representation.
To answer your questions :
- Embeddings are the core resources used and crafted by deep-learning models. You understand it as a vector representing any input into a high dimensional space. You can learn about it on hugging face tutorials !
- Transformers models use Cosine Similarity to compute attention. Hence, the common practice is to use this metric to perform clustering. Even though you can use euclidean and see how it goes
thank you so much!
you were saying this is the ‘easiest/fastest’ way… what if I want something more ‘powerful’ even if it means learning or trying much more complex concepts? do you have an idea for such thing as well?
If you want to go one step further, you can try to fine yourself a model on your dataset to teach him the specific task on a specific dataset. However this comes at cost and is not guaranteed to success.
You can also try a image to text model. You predict for every image a caption and cluster them using this caption instead of the complete image. This has the advantage of simplifying the embedding space as for a give caption “a man with a hat” you can have 10000+ pictures but a single caption. It could be called a modality bottleneck : when expressing a concept is less complex with one modality instead of others.
This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.