Im trying to understand a bit more about time embedding in diffusion models.
I saw you were using the positionnal encoding used in “Attention is all you need” that essentially maps any t to a vector pos_t of length dim (input) where pos_t[2i] = sin(fct of t) and pos_t[2i+1] = cos(fct of t).
Diving into your implementation “diffusers/embeddings.py at v0.11.0 · huggingface/diffusers · GitHub”
I observe that the mapping is defined by the function
flip_sin_to_cos: bool = False,
downscale_freq_shift: float = 1,
scale: float = 1,
max_period: int = 10000,
which is then wrapped as a nn.Module in the class Timesteps(nn.Module).
What I dont understand clearly is the point of the class TimestepEmbedding(nn.Module), I thought it simply applied neural transformation to a input of shape (batch_size,t) and output a tensor of same shape which would eventually be fed to the get_timestep_embedding function but it seems that the forward method does not preserve the size of the input.
Could you explain me the point of that class ?
Also, now that for a batch of timesteps you have their embedding (batch_size, embedding_dimension), how are they to the unet jointly with the image ?
Thanks a lot!