Fine-tune CLIP on remote sensing image data to enable zero-shot satellite image classification and captioning.
The model will be trained in english.
RSICD + any extra data we can find
RSICD is used for remote sensing image captioning task. more than ten thousands remote sensing images are collected from Google Earth, Baidu Map, MapABC, Tianditu. The images are fixed to 224X224 pixels with various resolutions. The total number of remote sensing images are 10921, with five sentences descriptions per image.
Possible links to publicly available datasets include:
- GitHub - 201528014227051/RSICD_optimal: Datasets for remote sensing images (Paper:Exploring Models and Data for Remote Sensing Image Caption Generation)
The training script for CLIP is on the way (PR).
An example notebook/script for CLIP domain adaptation.
A zero-shot satellite image captioning/classification demo app.
A zero-shot text-to-image search app: finding relevant areas given bird-view images and text query.
The following links can be useful to better understand the project and
what has previously been done.