Is there a time series model (like TimesFormer) which extracts features from 4 channel input images?

Regular TimesFormer takes 3 channel input images, while I have 4 channel images (RGBD).
I am struggling to find a TimesFormer (or a model similar to TimesFormer) that takes 4 channel input images.
Does anybody know such a model?
Preferably, I would like to find pretrained model with weights.