Fine-tune CLIP on satellite images+captions

arampacha · June 29, 2021, 9:41am

Fine-tune CLIP on satellite image data

Description

Fine-tune CLIP on remote sensing image data to enable zero-shot satellite image classification and captioning.

Language

The model will be trained in english.

Model

CLIP

Datasets

RSICD + any extra data we can find

RSICD is used for remote sensing image captioning task. more than ten thousands remote sensing images are collected from Google Earth, Baidu Map, MapABC, Tianditu. The images are fixed to 224X224 pixels with various resolutions. The total number of remote sensing images are 10921, with five sentences descriptions per image.

Possible links to publicly available datasets include:

GitHub - 201528014227051/RSICD_optimal: Datasets for remote sensing images (Paper:Exploring Models and Data for Remote Sensing Image Caption Generation)

Training scripts

The training script for CLIP is on the way (PR).

Desired project outcome

An example notebook/script for CLIP domain adaptation.
A zero-shot satellite image captioning/classification demo app.
A zero-shot text-to-image search app: finding relevant areas given bird-view images and text query.

Reads

The following links can be useful to better understand the project and

what has previously been done.

Exploring Models and Data for Remote Sensing Image Caption Generation

patrickvonplaten · June 29, 2021, 3:14pm

Great idea and very much feasible IMO (cc @valhalla ). Noting it down Also great that you’ve already found fitting datasets

goutham794 · June 29, 2021, 4:20pm

Hello. I’m interested to join this project.

devv · June 29, 2021, 11:29pm

Hi, I am interested in this project. Unable to get access to the sheet to add my name though.

valhalla · June 30, 2021, 8:11am

Awesome idea!

Added you the team @goutham794 and @devv

skylord · June 30, 2021, 2:10pm

This is very interesting. Pls add me to the project

ghosh-r · June 30, 2021, 7:25pm

Hi, I am interested in joining the team for this project.

I am working as a Computer Vision Research Consultant solving real-world problems. I have expertise in Computer Vision.
I have experience with Python, PyTorch, and version control systems.
I also have experience on working on Kaggle competitions with other team members.
I have good writing and communication skills, and contribute significantly towards creating the demo and/or any write-up that we might decide to publish.

This project piques my interest as I am interested in new developments in any field involving Vision.

Would love to be a part of this project.

ghosh-r · June 30, 2021, 9:07pm

@valhalla and @patrickvonplaten what do you think about adding me?

Hope you don’t mind me at-mentioning you folks. The deadline is quite near! 🥲

arampacha · June 30, 2021, 10:42pm

Hey everyone! Looks like we have some nice team forming here!

Here is a couple of suggestions for first steps to take:
Communication: dedicated discord server looks like a nice option for me. (Let me know if you have better option in mind)
Data: we can prepare the dataset and upload it to the hub. I can start on it tomorrow. Guess it would be better to contact the authors of the dataset before?
Model: we can plan a paper discussion meeting to talk about CLIP with focus on training/fine-tuning hyperparameters. Plus we can make a list of other relevant papers and distribute those between volunteers to prepare review. Here is a good place to start with it RSICD Dataset | Papers With Code.

If wee can make zero-shot image classification for detecting illegal deforestation, mining etc. it would be nice outcome for this project.

Please share your thoughts on this

patrickvonplaten · July 1, 2021, 9:41am

added you @ghosh-r !

ghosh-r · July 1, 2021, 11:59am

Zero-Shot Image Classification:

A week is really a very short time for multiple project goals. But, yes, I am willing to discuss more.

Uploading the data

Surprisingly, the authors of the paper do not mention a License for the dataset. But they do not explicitly forbid the use of the dataset for any purposes.

In the Dataset README, it is mentioned that-

RSICD is used for remote sensing image captioning task. The detailed information about this dataset can be found in our paper “Exploring Models and Data for Remote Sensing Image Caption Generation”.
If you use our dataset, please cite our paper above.

So, they are expecting a citation, which is perfect.

So, I think it would be safe to add their dataset without their explicit permission if we cite them properly.

Communication:

A dedicated Discord sounds perfect. In such projects, it is very important to have screen-sharing and voice calling capabilities. Discord has all that.

Model:

Yes absolutely. We should read the paper first and then host a session to talk about it, and clear doubts. We can look at other examples.

Additional Comments:

We need to have a very good understanding of the structure of the dataset so that we can get started with it. It is very important to create a baseline first and then improve upon it.

arampacha · July 1, 2021, 7:10pm

Here is a invite link to discord server Flax-HuggingFace-Community-Week channel clip-rsicd. Come join and we can start discussing organizational stuff

Hirviturkki · September 29, 2021, 9:07pm

Hey!

Any updates on this project, how is it going?

Over and out

Yusin · April 2, 2022, 10:24pm

Sounds interesting. Can I join your team~

arampacha · April 6, 2022, 1:12pm

Hi. Thanks for you interest. Guess I should’ve posted some summary while ago, so here it is.
This project was a part Flax/JAX community event and in some sense completed. We had obtained some nice results and made it into the top 3 projects for the event. All the details and relevant code can be found in our repo. You can also check-out blogpost by our team on HuggingFace blog and nice app @sujitpal made to demonstrate some capabilities of the model.
We also identified couple of potential directions for further research. You can read through those on our github. You are welcomed to work on any of those tasks if you’d like.

Topic		Replies	Views
Image Captioning fine tuning 🤗Transformers	0	438	February 25, 2023
CLIP Image to Text search Beginners	0	898	December 19, 2022
Use OpenAI's CLIP for image search 🤗 Course Projects	21	4334	June 4, 2024
Image captioning for Spanish with pre-trained vision and text model Flax/JAX Projects	13	2495	July 19, 2021
Image to Text model that can take an additional text as input for context 🤗Hub	1	486	September 5, 2023