I’ve been scourging the forums and the internet and it seems there’s very little documentation about VisionTextdualEncoder. Specifically about how we can use the unified embedding space to do tasks that would require a decoder like image segmentation. Can anyone help or direct me to the next step? To stay in the HF ecosystem, I could imagine a scenario where we create a child model class that derives from VisionTextDualEncoder but then adds a classifier at the end? Thanks!