A couple of super basic questions

Since diffusion is basically a new generative model for approximating a distribution to be generated using a latent variable, like GAN or VAE, I’m guessing we can use it on pretty much any kind of data. That is, while images are 3-channel, 2D data types, we could also generate arbitrary kinds of N-dimensional data using the technique. Is that correct?

For generation, my understanding is that text conditioning, for example, is basically a matter of creating an embedding vector of text and image data that can be used to influence/tweak the latent during generation. So, the conditional generation process is something like: text_embedding + image_data + noise = new_image. Does this mean that we could also use arbitrary data as input for conditional generation (obviously given that there is some reasonable conditioning information in that data)?

Thanks in advance for any guiding thoughts.

Hi @jbmaxwell! I think you are correct in both :slight_smile:

For example, there’s now a 1D UNet that was added to support audio data in Dance Diffusion, and that will be used for other 1D tasks.

Regarding conditioning, Stable Diffusion uses text as the conditioning signal, as you have observed, but other tasks may use different types of data. For example using “class conditioning” it is possible to train a diffusion model to generate images from certain classes, such as the different ImageNet labels.

Awesome! Thanks so much for the response, @pcuenq.

This makes sense to me. I see that 1D UNet also has a channels argument, so I’m guessing I could create input representations that split an input over some arbitrary number of channels—super cool!

It takes a noisy sample and a time step, and I’m guessing the time step is the time step for the diffusion itself (the Markov chain), not some external time indicator, correct? As I understand it, stable diffusion doesn’t have any explicit temporal model, at this point (unlike a causal transformer, for example).

For conditioning, right now I’m most interested in building hierarchical representations. So, using image generation as an analogy, I’m thinking of a system involving two models, one that’s a high-level rough “sketch” of some kind, and another that can generate detailed “tiles” from that sketch. So here I’m imagining that the conditioning info could be the high-level sketch (or some encoded representation of it) and an indicator of the position of the tile in the sketch… I wouldn’t be using images, per se, but rather a custom data representation, but that’s the rough idea. Importantly, the tile doesn’t resemble the sketch, but its content is informed by the sketch and the desired coordinate, and
also a text description to capture the users intention wrt the details of the tile.
So, for me, the overall process might be something like:

high_level_image_vector + tile_position + text_embedding + noise = new_tile_image

Does that seems possible?

Actually, what I was thinking of is pretty much this…


So, already done!

1 Like