While going through the codebase , I found the following code under the BertSelfAttention class here - (BertSelfAttention-
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
I am not sure I understand the reason for calculating all_head_size again instead of assigning it the value of config.hidden_size directly.
Am I missing something?
I also see a check right above this that ensures that the hidden_size is divisible by num_attention_heads:
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError(
"The hidden size (%d) is not a multiple of the number of attention "
"heads (%d)" % (config.hidden_size, config.num_attention_heads)
)