Question about all_head_size under BertSelfAttention

ndharap · July 13, 2020, 8:49pm

While going through the codebase , I found the following code under the BertSelfAttention class here - (BertSelfAttention-

self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size

I am not sure I understand the reason for calculating all_head_size again instead of assigning it the value of config.hidden_size directly.

Am I missing something?

I also see a check right above this that ensures that the hidden_size is divisible by num_attention_heads:

    if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
        raise ValueError(
            "The hidden size (%d) is not a multiple of the number of attention "
            "heads (%d)" % (config.hidden_size, config.num_attention_heads)
        )

Topic		Replies	Views
Bert Config: Num attention heads 🤗Transformers	2	1095	March 7, 2023
Sizes of Query, key and value vector in Bert Model 🤗Transformers	3	5942	March 25, 2021
BertSelfAttention, BertSelfOutput implementation 🤗Transformers	4	714	August 11, 2022
Changing of value in Config file Beginners	0	300	August 29, 2022
Specify attention masks for some heads in multi-head attention Intermediate	3	2343	November 17, 2020

Question about all_head_size under BertSelfAttention

Related topics