Training embeddings of tokens

I added few tokens to the tokenizer, and would like now to train roberta model. Will it automatically also tune the embedding layer (the layer that embeds the tokens), or is there any flag or anything else I should change so that embedding layer will be tuned? Schematically, my code looks like that:

model = RobertaForSequenceClassification.from_pretrained(‘roberta-base’, num_labels=2)
trainer = Trainer(
model=model,
args=training_args, # training arguments, defined above

)
trainer.train()

and the input data is the tokenized data.

1 Like

[I am not an expert, but I believe this is right]

If you need to add different tokens, then you will need to train the RoBerta model from scratch.
(You probably don’t want to do that.)
It doesn’t work to change the tokens after the model has been pre-trained.

Do you definitely need to add different tokens?
If you just include your different tokens in your data, the tokenizer will probably deal with them OK, by representing them as combinations or tokens it already knows.
I recommend Chris McCormick’s blog posts about this, BERT Word Embeddings Tutorial · Chris McCormick

By default, if you fine-tune a pre-trained RoBerta model, then the embedding layer will be very slightly tuned. Most of the tuning change will happen in the last few layers, especially the Classification head layer.

If you want to tune ONLY the last layer(s), you can Freeze the earlier layers. (It isn’t possible to freeze the later layers and tune the earlier ones).

@ rgwatwormhill I just want to add couple of hundreds tokens, representing technical terms in the text. As far as I can see from the answer here: Adding New Vocabulary Tokens to the Models · Issue #1413 · huggingface/transformers · GitHub, adding tokens does not require training from scratch, but it does require fine tuning the embedding layer (as the embeddings of these tokens are initialized randomly). So I was wondering if the embedding layer is tuned automatically as part of the training process. From your answer it seems like it is indeed the case?

3 Likes