[Question] Why does vocab size determine training parameters

I have a not so smart question: Why does the vocab size increase training parameters by a lot?
The following configuration:

from transformers import RobertaConfig
config = RobertaConfig(
    vocab_size=48000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Passed to the model gives 80 mil parameters

model = RobertaForMaskedLM(config=config)
model.num_parameters() → 80.4 milion

If vocab_size is reduced to 16k,

model = RobertaForMaskedLM(config=config)
model.num_parameters() → 55 milion

The Embedding layers at the start increases training parameters so much?

        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

The layer embeding takes:

Embedding(vocab_size, embedding_space, input_length=max_length))

Where in my case embedding_space is 768 (where 768*48000 - 768 * 16000 ~ 25 mil.)
Therefore the number of training parameters would be vocab_size * embedding_space?

I’m not sure where you question is, all your math is correct and yes, the embedding matrix is responsible for a looot of the model parameters.

1 Like

Thanks for confirming my check!