Fine-tune CLIP on satellite images+captions

Fine-tune CLIP on satellite image data


Fine-tune CLIP on remote sensing image data to enable zero-shot satellite image classification and captioning.


The model will be trained in english.




RSICD + any extra data we can find

RSICD is used for remote sensing image captioning task. more than ten thousands remote sensing images are collected from Google Earth, Baidu Map, MapABC, Tianditu. The images are fixed to 224X224 pixels with various resolutions. The total number of remote sensing images are 10921, with five sentences descriptions per image.

Possible links to publicly available datasets include:

Training scripts

The training script for CLIP is on the way (PR).

Desired project outcome

An example notebook/script for CLIP domain adaptation.
A zero-shot satellite image captioning/classification demo app.
A zero-shot text-to-image search app: finding relevant areas given bird-view images and text query.


The following links can be useful to better understand the project and

what has previously been done.


Sounds very interesting! May I join the project?



Great idea and very much feasible IMO (cc @valhalla ). Noting it down :slight_smile: Also great that you’ve already found fitting datasets

Hello. I’m interested to join this project.

1 Like

Hi, I am interested in this project. Unable to get access to the sheet to add my name though.

1 Like

Awesome idea!

Added you the team @goutham794 and @devv :slight_smile:

Hey, will love to contribute :artificial_satellite:

1 Like

This is very interesting. Pls add me to the project

Hi, I am interested in joining the team for this project.

  • I am working as a Computer Vision Research Consultant solving real-world problems. I have expertise in Computer Vision.
  • I have experience with Python, PyTorch, and version control systems.
  • I also have experience on working on Kaggle competitions with other team members.
  • I have good writing and communication skills, and contribute significantly towards creating the demo and/or any write-up that we might decide to publish.

This project piques my interest as I am interested in new developments in any field involving Vision.

Would love to be a part of this project.

@valhalla and @patrickvonplaten what do you think about adding me?

Hope you don’t mind me at-mentioning you folks. The deadline is quite near! 🥲

1 Like

Hey everyone! Looks like we have some nice team forming here!

Here is a couple of suggestions for first steps to take:
Communication: dedicated discord server looks like a nice option for me. (Let me know if you have better option in mind)
Data: we can prepare the dataset and upload it to the hub. I can start on it tomorrow. Guess it would be better to contact the authors of the dataset before?
Model: we can plan a paper discussion meeting to talk about CLIP with focus on training/fine-tuning hyperparameters. Plus we can make a list of other relevant papers and distribute those between volunteers to prepare review. Here is a good place to start with it RSICD Dataset | Papers With Code.

If wee can make zero-shot image classification for detecting illegal deforestation, mining etc. it would be nice outcome for this project.

Please share your thoughts on this

1 Like

added you @ghosh-r !

1 Like

Zero-Shot Image Classification:

A week is really a very short time for multiple project goals. But, yes, I am willing to discuss more.

Uploading the data

Surprisingly, the authors of the paper do not mention a License for the dataset. But they do not explicitly forbid the use of the dataset for any purposes.

In the Dataset README, it is mentioned that-

RSICD is used for remote sensing image captioning task. The detailed information about this dataset can be found in our paper “Exploring Models and Data for Remote Sensing Image Caption Generation”.
If you use our dataset, please cite our paper above.

So, they are expecting a citation, which is perfect.

So, I think it would be safe to add their dataset without their explicit permission if we cite them properly.


A dedicated Discord sounds perfect. In such projects, it is very important to have screen-sharing and voice calling capabilities. Discord has all that.


Yes absolutely. We should read the paper first and then host a session to talk about it, and clear doubts. We can look at other examples.

Additional Comments:

We need to have a very good understanding of the structure of the dataset so that we can get started with it. It is very important to create a baseline first and then improve upon it.

Here is a invite link to discord server Flax-HuggingFace-Community-Week channel clip-rsicd. Come join and we can start discussing organizational stuff :hugs:



Any updates on this project, how is it going?

Over and out

Sounds interesting. Can I join your team~

Hi. Thanks for you interest. Guess I should’ve posted some summary while ago, so here it is.
This project was a part Flax/JAX community event and in some sense completed. We had obtained some nice results and made it into the top 3 projects for the event. All the details and relevant code can be found in our repo. You can also check-out blogpost by our team on HuggingFace blog and nice app @sujitpal made to demonstrate some capabilities of the model.
We also identified couple of potential directions for further research. You can read through those on our github. You are welcomed to work on any of those tasks if you’d like.