[Question] Why does vocab size determine training parameters

Andrija · August 4, 2021, 2:20pm

I have a not so smart question: Why does the vocab size increase training parameters by a lot?
The following configuration:

from transformers import RobertaConfig
config = RobertaConfig(
    vocab_size=48000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

Passed to the model gives 80 mil parameters

model = RobertaForMaskedLM(config=config)
model.num_parameters() → 80.4 milion

If vocab_size is reduced to 16k,

model = RobertaForMaskedLM(config=config)
model.num_parameters() → 55 milion

The Embedding layers at the start increases training parameters so much?

        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)

The layer embeding takes:

Embedding(vocab_size, embedding_space, input_length=max_length))

Where in my case embedding_space is 768 (where 768*48000 - 768 * 16000 ~ 25 mil.)
Therefore the number of training parameters would be vocab_size * embedding_space?

sgugger · August 4, 2021, 3:04pm

I’m not sure where you question is, all your math is correct and yes, the embedding matrix is responsible for a looot of the model parameters.

Andrija · August 5, 2021, 11:02am

Thanks for confirming my check!

Topic		Replies	Views
Pretraining RoBERTa from scratch breaks down when using tokenizer with smaller vocabulary Beginners	2	1677	March 7, 2021
No option for embedding size in transformers.RobertaConfig Beginners	1	196	January 23, 2023
How does the vocabulary size count towards total parameter size of a model? Research	0	2305	January 18, 2022
Claritifcation about the `max_position_embeddings` argument 🤗Transformers	1	483	January 27, 2023
Increasing validation loss even with small learning rate - RoBERTa Models	0	1122	March 1, 2021

[Question] Why does vocab size determine training parameters

Related topics