Question about the attention mask of text embedding

xingjchen · August 3, 2023, 12:42pm

Hi, I am new to SD, I’d like to ask since the shape of text embedding extracted from CLIP is (bs, 77, 768) when I input this embedding to the UNet to predict noise, do I need to input the ‘attention_mask’ of this sentence? Or the text embedding has already carried with the padding information from the CLIP text encoder thus it is no need to input the attention_mask? What is the setting of the official SD?

Thanks a lot!

Topic		Replies	Views
How to use `inputs_embed` and `attention_mask` together? Intermediate	1	923	May 19, 2024
CLIPTextModel's get_text_features VS pooled outputs 🤗Transformers	1	446	August 30, 2024
Is attention_mask in LanguageModels such as GPT2LMHeadModel related to attention mechanism is it just to specify padding tokens Beginners	2	206	June 27, 2024
Role of attention mask in base Bert 🤗Transformers	0	329	December 22, 2022
Can attention_mask hold float values in [0,1] in T5? How these masks act in Attention blocks? 🤗Transformers	0	690	May 26, 2022

Question about the attention mask of text embedding

Related topics