Advancements in computer vision and deep learning techniques carry the potential to make significant contributions to healthcare. The current state-of-the-art models for automated diagnosis and outcome prediction using medical imaging tend not to consider additional information such as medical reports.
A multimodal model like CLIP pre-trained in medical data could allow new medical applications that combine text and image.
Model
Pre-trained ViT and SciBERT models can be found on the model hub.
Dataset
The MIMIC-CXR dataset can be used for this task. For privacy reasons, the dataset in question has restricted access. Anyone who wants to participate in this project must obtain the necessary credentials to access the dataset.
In my experience, getting access to MIMIC-CXR is not particularly complicated, it’s necessary to accept the terms of the license and take a short course on medical data management. It normally takes ~2 weeks to get such credentials.
Available training scripts
A training script for this will be provided soon. (see PR )
Challenges
Carrying out a proper evaluation of the model may be difficult
@shpotes That’s an interesting project to work on. I’ve worked on Transformers with MIMIC-CXR database earlier, and I would like to experiment how CLIP fares out.
Regarding the database, I believe that one of the team member in a group can get access to the database for working out with it.
Lets connect and form a group, if possible, to carry out this project forward.
Hi, I am Sweta from India. I am working on deep learning for medical image analysis for my msc thesis, and am generally interested in applications of AI in medicine/healthcare. With this project, I will be able to work on a new dimension, i.e., NLP in healthcare. Hence, I am ver interested in joining this project and working with everyone to hone my NLP skills.
My time zone is IST(GMT + 5:30).
According to the dataset license, It is not possible to share the dataset with anyone else (I assume that also includes any participant in the project).
It would be important to see if the data can be used and if so how! Also maybe it might make sense to fine-tune the official CLIP weights on the medical data instead of pretraining from scratch ? @valhalla
I also think that it might make more sense to finetune the official CLIP weights! Applying for the data might take a few days though so if this project is going to be selected, we might want to take that into account!
I think this initiative can qualify as “lawful use in scientific research”, so I don’t think there is any problem. In any case, I can communicate directly with the license owners and ask them about it.
Also maybe it might make sense to fine-tune the official CLIP weights on the medical data instead of pretraining from scratch?
Considering the amount of data available, fine-tuning will probably work better. The main reason why I proposed to train it from scratch instead of finetuning is the vocabulary. Since the medical vocabulary stands out for its large number of unusual terms in more standard domains with which standard tokenizers tend to have problems (See for instance Beltagy et al., (2019)).
I suppose that techniques such as recycling (de Vries & Nissim, 2021) or Adapters (Houlsby et al., 2019; Pfeiffer et al., 2020), could solve this problem.
hi! I would be very interested in joining this project! I am a ML Engineer at Ferrum Health - a healthcare startup in San Francisco, working on both NLP and computer vision. I am familiar with DICOM and I am currently working with another of the MIMIC datasets (MIMIC-III).
As Patrick said, please see if the data can be made available before the sprint!
And regarding fine-tuning and medical vocabulary I think in this case we could maybe use a text encoder which is trained on medical data and then pair it with CLIP’s vision model instead of starting from scratch.