Create Custom Loss function for transformers using a diffusion model and CLIP

Hi all,

I am working with Llama 2 and the transformer library. I want to create a custom loss function (which I found to be possible by overriding the compute_loss function) to fine-tune Llama 2.

What I plan on doing, is to have a loss that is computed by taking the token output of Llama 2 and then
(a) get the (text-) output and
(b) doing a bit of text cleaning (e.g., remove emojis) and then
(c) feed the cleaned text as image prompt into a diffusion model (e.g., Stable Diffusion XL-turbo) to generate an image, then
(d) take the generated image (from c) and the text (from b) and compare them using CLIP (e.g., “ViT-B/32”) and the cosine similarity of the text/image embeddings

The loss would then be something like 1-clip_score. Accordingly, the fine-tuning process for Llama 2 would then “optimize” such that the text-outputs would be in a way that they fit the (later) generated image well (“well” determined by the similarity score)

I am able to compute (a)-(d) and therefore the loss"value" as part of the compute_loss function in the fine-tuning process, but currently I am not able to (automatically) compute the respective gradients with them, so the fine-tuning process does not optimize/ reduce the loss.

So my questions would be:
(1) Is that possible?
(2) How can I compute/ obtain the gradients for the steps (b)-(d), such that the loss function gets them and can optimize for the loss described above? (I would also be interested in partial answers, e.g., how to get the gradients from input to output for a diffusers model only)
(3) Alternatively (if (1) is “no”): Is there a way around it to still optimize the same (or a similar) loss, e.g., with a derivative-free method (which as I understood it might be not implemented natively) or with a differenty proxy for the loss function?

Sorry for the long explaination, I hope it is possible to understand what I want to do
Thank you so much