I have a not so smart question: Why does the vocab size increase training parameters by a lot?
The following configuration:
from transformers import RobertaConfig
config = RobertaConfig(
vocab_size=48000,
max_position_embeddings=514,
num_attention_heads=12,
num_hidden_layers=6,
type_vocab_size=1,
)
Passed to the model gives 80 mil parameters
model = RobertaForMaskedLM(config=config)
model.num_parameters() → 80.4 milion
If vocab_size is reduced to 16k,
model = RobertaForMaskedLM(config=config)
model.num_parameters() → 55 milion
The Embedding layers at the start increases training parameters so much?
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
The layer embeding takes:
Embedding(vocab_size, embedding_space, input_length=max_length))
Where in my case embedding_space is 768 (where 768*48000 - 768 * 16000 ~ 25 mil.)
Therefore the number of training parameters would be vocab_size * embedding_space?