Stable Diffusion ControlNet fine-tune quality issue

Hello,

My project is on semantic video transmission using diffusion models. The idea is that e.g. in scenarios of drone video transmissions, instead of sending the entire video sequence due to throughput/power/bandwidth, etc. restrictions, one will send only the some frames together with the segmentation map of the frames that you do not send, and on the receiver side a diffusion model will be utilised that will be conditioned on the previous frames and the seg map of the missing frame to reconstruct that missing frame.

The model I use is Stable Diffusion ControlNet pipeline. In my case, stable diffusion instead of a text prompt takes an image prompt to condition on the previous frames and the ControlNet is used to make the generated image follow the segmentation map. The UNet and ControlNet are finetuned using LoRA simultaneously (I have also tried one after the other), and all the other modules (CLIP image encoder, VAE) remain frozen. For inference I use the StableDiffusionControlNet pipeline and calculate the vector embeddings of the previous frames outside, I concatenate them using the same aggregation method as in training e.g. by simply averaging (I have also tried using LSTM for this purpose) and I pass the resulting embedding onto the text_prompt_embeddings input argument of the pipeline. Note that of course I first load the finetuned UNet and ControlNet into the pipeline I use (as well as the VAE I used during training to be sure). The dataset I am using is UAVid consists of drone footage of cities of length 10 frames or less (variable length) together with the seg map of each frame. Each frame-image is 512x512.

The issue I am facing is that the results have been sort of capped and cannot seem to improve, and their quality does not meet my expectations/not satisfactory. Basically, what I have concluded is that ControlNet learns pretty well, as the prediction follows the segmentation map 95% of the times. From what I have seen, I believe that the issue is either from CLIP encoder (maybe it does not produce descriptive enough embeddings), or UNet (maybe it has gone into saturation, collapse). The pre-train baseline I normally use is sd-image-variations v2 (which gives me the best results), which has already been fine-tuned to accept image prompts, but I have also tried the runwaml stable diffusion 1.5 with no improvement. To check whether CLIP image encoder is the issue, I have tried training concurrently a shallow MLP Mapper that will act as a supporting module to assist the domain transfer, i.e. the aggregated embedding before being fed into UNet is first passed through an MLP to potentially make it richer/more descriptive. However, the use of an MLP results in worse performance. Finally, I have tried also different CLIP image encoders (e.g. StreetCLIP, OpenAI CLIP, etc.), but the issue persists.

One final experiment I did is a “cheat” scenario, where I pass the ground truth image from the CLIP encoder and into the UNet, instead of the previous frames. This is to test the limits of what I am doing because, even if I use the most sophisticated method during aggregation of the embeddings, these will not become any more descriptive than using the actual ground truth that we are trying to predict. Therefore, the task of the pipeline effectively is to reconstruct the input to the output. However, the results are the same (slightly better) than the normal experiments I have previously described. Thus, I have come to the conclusion that, either UNet has an issue and collapses or something, or this architecture cannot achieve any better results for what I am trying in general.

Does someone have any idea as to what could be the issue, or what else could I try to improve the quality/details of my results?

Any insight/idea will be highly appreciated!!

Thank you!!

PS: Sorry for the long post :slight_smile: