I am currently working on building a system named ClipCap (CLIP Captions) at LAION, which takes in a CLIP embedding as an input and performs the reverse-dalle task of captioning the image. This system creates inputs_embeds using the CLIP embedding - so that when it comes to inference, it’s a bit of a pain to handle.
Is it possible to use inputs_embeds as an input in the generate() method? If not, could someone point me to any resources that could help me to recreate these methods from scratch so that I could use inputs_embeds as the input?
Many thanks in advance,