Please read the topic category description to understand what this is all about
One of the most exciting developments in 2021 was the release of OpenAI’s CLIP model, which was trained on a variety of (text, image) pairs. One of the cool things you can do with this model is use it for text-to-image and image-to-image search (similar to what is possible when you search for images on your phone).
The goal of this project is to experiment with CLIP and learn about multimodal models. Several ideas can be explored, including:
- Create a text-to-image search engine that allows users to search for images based on natural language queries. Although CLIP was only trained for English text, you can use techniques like Multilingual Knowledge Distillation to extend the embeddings to new languages
- Create an image-to-image search engine that returns similar images, given a “query” image.
The CLIP models can be found on the Hub
A common dataset that’s used for image demos is the Unsplash Dataset. You can get access to it here
This project goes beyond that concepts introduced in Part II of the Course, so some familiarity with computer vision would be useful. Having said that, the Transformers API is similar for image tasks, so if you know how the
pipeline() function works, then you’ll have no trouble adapting to this new domain.
- Create a Streamlit or Gradio app on Spaces that allows a user to find images that resemble a natural language query or input image.
- Don’t forget to push all your models and datasets to the Hub so others can build on them!
To chat and organise with other people interested in this project, head over to our Discord and:
- Follow the instructions on the
- Join the
#image-searchchannel (currently full!)
- Join the
Just make sure you comment here to indicate that you’ll be contributing to this project