Understanding timestep embedding

StableDiffuser317 · April 29, 2023, 11:57am

Hello,
Im trying to understand a bit more about time embedding in diffusion models.
I saw you were using the positionnal encoding used in “Attention is all you need” that essentially maps any t to a vector pos_t of length dim (input) where pos_t[2i] = sin(fct of t) and pos_t[2i+1] = cos(fct of t).
Diving into your implementation “diffusers/embeddings.py at v0.11.0 · huggingface/diffusers · GitHub”
I observe that the mapping is defined by the function
get_timestep_embedding(
timesteps: torch.Tensor,
embedding_dim: int,
flip_sin_to_cos: bool = False,
downscale_freq_shift: float = 1,
scale: float = 1,
max_period: int = 10000,
)
which is then wrapped as a nn.Module in the class Timesteps(nn.Module).
What I dont understand clearly is the point of the class TimestepEmbedding(nn.Module), I thought it simply applied neural transformation to a input of shape (batch_size,t) and output a tensor of same shape which would eventually be fed to the get_timestep_embedding function but it seems that the forward method does not preserve the size of the input.
Could you explain me the point of that class ?
Also, now that for a batch of timesteps you have their embedding (batch_size, embedding_dimension), how are they to the unet jointly with the image ?
Thanks a lot!

Topic		Replies	Views
Help debugging diffusion model 🧨 Diffusers	0	999	October 4, 2023
Question about time embedding 🧨 Diffusers	1	595	May 2, 2023
Add additional conditioning info 🧨 Diffusers	21	8264	March 3, 2025
Image reconstruction with diffusion model 🧨 Diffusers	0	744	March 9, 2024
Diffusers documentation has some error code,i have fixed it 🧨 Diffusers	1	420	May 26, 2023

Understanding timestep embedding

Related topics