I’m training a customized latent diffusion model, which replaces text embeddings with a custom embedding, and I’m finding that, while the loss drops in a fairly normal way for the first few hundred steps, it then levels off and only improves very slowly. The loss itself is also quite high compared to the standard stable diffusion training—i.e., levelling off at around 0.58 vs 0.1 for stable diffusion (in previous runs on my data, that is). In part this is because I’ve added a term to the loss function, but that shouldn’t account for quite so much difference.
One thing I’m wondering is whether perhaps the standard cross_attention_dim
size of 768 is too small for my data? Does that make sense? Because I’ve never done this kind of “surgery” before (ha) I stuck with the original text-embedding size (77, 768), just to be as “compliant” as possible, and made my embedding fit that. But maybe it doesn’t offer enough parameters for the model to learn what I’m trying to learn?
I should clarify that I haven’t tested a checkpoint yet, to see whether it’s working as expected, so I don’t really know anything’s wrong. The loss just seems quite high to have stopped improving, compared to what I’m used to from standard stable diffusion training.
Any thoughts appreciated.
EDIT: I should clarify that the latent the unet is generating is the same shape as the standard model: (4, 64, 64).
1 Like
Actually, watching an unaltered train_text_to_image.py
script running on my data, from scratch, I’m noticing that it’s quite a similar pattern. Once it reaches around 0.1 it bounces around a lot, without actually improving. Is that typical, or is it perhaps an issue with my data? I’m working with a small “dev” dataset, intended to represent one type of output but to be small enough for quick iteration/experimentation. Is it possible that hitting plateaus like this is an indication of a dataset problem, rather than training—i.e., the model learns what it can from what I’m giving it, but stops improving because there isn’t enough to generalize?
Other than that, I’m wondering if there are other kinds of learning-rate scheduling I can use with the train_text_to_image.py
script?
super interesting @jbmaxwell !
unfortunately I’m not sure how much help I could be here as it sounds relatively specific to your situation.
Re embedding size, off the top of my head, I don’t know of any research that looks at effects of different text embedding sizes. If you know of any would be super interested in taking a look
Re loss graphs, my very basic understanding is that images based diffusion models don’t have super clean or well understood loss graphs yet. iirc the cold diffusion paper covers some of this and looks at training diffusion networks from first principles [2208.09392] Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise
Sorry to not be much help and please don’t hold it against me if any of my advice isn’t correct here haha!
Hi @williamberman ,
Thanks for the reply! After checking out results from training at 20 vs 40 epochs it seems clear to me that there is still lots of improvement going on, even though the loss doesn’t appear to be changing. The bouncing around seems to just be part of the process.
In an intuitive, hand-wavy way it make sense to me, since the loss function isn’t actually evaluating the image but only the denoising, and there is generally a degree of noise inherent to the images themselves. So we’re really talking about the model learning to “move noise around” correctly, rather than removing it (which could result in an “incorrect” image). But certainly anecdotally, the results become clearer and more articulate with further training, even though the loss doesn’t change.
And I suppose all those little “peaks” ensure that there’s always something for SGD to work with… Very intriguing process, really… if a bit “magical” for my comfort level…
1 Like