MedClip - Pretraining CLIP on medical data


Advancements in computer vision and deep learning techniques carry the potential to make significant contributions to healthcare. The current state-of-the-art models for automated diagnosis and outcome prediction using medical imaging tend not to consider additional information such as medical reports.

A multimodal model like CLIP pre-trained in medical data could allow new medical applications that combine text and image.


Pre-trained ViT and SciBERT models can be found on the model hub.


The MIMIC-CXR dataset can be used for this task. For privacy reasons, the dataset in question has restricted access. Anyone who wants to participate in this project must obtain the necessary credentials to access the dataset.
In my experience, getting access to MIMIC-CXR is not particularly complicated, it’s necessary to accept the terms of the license and take a short course on medical data management. It normally takes ~2 weeks to get such credentials.

Available training scripts

A training script for this will be provided soon. (see PR )


Carrying out a proper evaluation of the model may be difficult


I would like to join this project! :slightly_smiling_face:

@shpotes That’s an interesting project to work on. I’ve worked on Transformers with MIMIC-CXR database earlier, and I would like to experiment how CLIP fares out.

Regarding the database, I believe that one of the team member in a group can get access to the database for working out with it.

Lets connect and form a group, if possible, to carry out this project forward.

Hi! I’m very interested in applying NLP to medicine and would love to join this project!

Hi, I am Sweta from India. I am working on deep learning for medical image analysis for my msc thesis, and am generally interested in applications of AI in medicine/healthcare. With this project, I will be able to work on a new dimension, i.e., NLP in healthcare. Hence, I am ver interested in joining this project and working with everyone to hone my NLP skills.
My time zone is IST(GMT + 5:30).

That’s a great idea - let’s officially define this project then :slight_smile:

Putting everybody in the official sheet here. More people can still join! Leave a comment here or in the sheet if you want to change something.

I would like to join this project as well! :slight_smile:

Awesome! Added you to the team :slight_smile:

1 Like

Hi, may I join as well?

Sure! Just added you the team :slight_smile:

According to the dataset license, It is not possible to share the dataset with anyone else (I assume that also includes any participant in the project).


Hey, have been also doing a lot of deep learning in healthcare space, would love to join this!

1 Like


I am very interested in this topic as well. I work as Sr. Clinical Data Scientist for Stanford.

Keep me posted in how I can contribute


It would be important to see if the data can be used and if so how! Also maybe it might make sense to fine-tune the official CLIP weights on the medical data instead of pretraining from scratch ? @valhalla


I also think that it might make more sense to finetune the official CLIP weights! Applying for the data might take a few days though so if this project is going to be selected, we might want to take that into account!

I think this initiative can qualify as “lawful use in scientific research”, so I don’t think there is any problem. In any case, I can communicate directly with the license owners and ask them about it.

Also maybe it might make sense to fine-tune the official CLIP weights on the medical data instead of pretraining from scratch?

Considering the amount of data available, fine-tuning will probably work better. The main reason why I proposed to train it from scratch instead of finetuning is the vocabulary. Since the medical vocabulary stands out for its large number of unusual terms in more standard domains with which standard tokenizers tend to have problems (See for instance Beltagy et al., (2019)).
I suppose that techniques such as recycling (de Vries & Nissim, 2021) or Adapters (Houlsby et al., 2019; Pfeiffer et al., 2020), could solve this problem.


Hi, this is an interesting idea. I would like to join this team if it is possible. I’m working as a Research Engineer (computer vision) at a medical technology startup in Tokyo.

1 Like

hi! I would be very interested in joining this project! I am a ML Engineer at Ferrum Health - a healthcare startup in San Francisco, working on both NLP and computer vision. I am familiar with DICOM and I am currently working with another of the MIMIC datasets (MIMIC-III).

1 Like

As Patrick said, please see if the data can be made available before the sprint!

And regarding fine-tuning and medical vocabulary I think in this case we could maybe use a text encoder which is trained on medical data and then pair it with CLIP’s vision model instead of starting from scratch.

Added you the team @kaushalya and @edugp :slight_smile:

1 Like