Training embeddings of tokens

Miriam · January 26, 2021, 6:41pm

I added few tokens to the tokenizer, and would like now to train roberta model. Will it automatically also tune the embedding layer (the layer that embeds the tokens), or is there any flag or anything else I should change so that embedding layer will be tuned? Schematically, my code looks like that:

model = RobertaForSequenceClassification.from_pretrained(‘roberta-base’, num_labels=2)
trainer = Trainer(
model=model,
args=training_args, # training arguments, defined above
…
)
trainer.train()

and the input data is the tokenized data.

rgwatwormhill · January 26, 2021, 11:52pm

[I am not an expert, but I believe this is right]

If you need to add different tokens, then you will need to train the RoBerta model from scratch.
(You probably don’t want to do that.)
It doesn’t work to change the tokens after the model has been pre-trained.

Do you definitely need to add different tokens?
If you just include your different tokens in your data, the tokenizer will probably deal with them OK, by representing them as combinations or tokens it already knows.
I recommend Chris McCormick’s blog posts about this, BERT Word Embeddings Tutorial · Chris McCormick

By default, if you fine-tune a pre-trained RoBerta model, then the embedding layer will be very slightly tuned. Most of the tuning change will happen in the last few layers, especially the Classification head layer.

If you want to tune ONLY the last layer(s), you can Freeze the earlier layers. (It isn’t possible to freeze the later layers and tune the earlier ones).

Miriam · January 27, 2021, 12:05am

@ rgwatwormhill I just want to add couple of hundreds tokens, representing technical terms in the text. As far as I can see from the answer here: Adding New Vocabulary Tokens to the Models · Issue #1413 · huggingface/transformers · GitHub, adding tokens does not require training from scratch, but it does require fine tuning the embedding layer (as the embeddings of these tokens are initialized randomly). So I was wondering if the embedding layer is tuned automatically as part of the training process. From your answer it seems like it is indeed the case?

Topic		Replies	Views
Domain adaptation of Language Model and Tokenizer Beginners	8	2852	June 17, 2024
Fine tune a saved model with custom tokenizer 🤗Transformers	3	2960	December 15, 2020
RoBERTa from scratch with different vocab vs. fine-tuning Intermediate	9	2227	August 20, 2020
How do I use a fine-tuned Trainer model for inference correctly? 🤗Transformers	0	981	June 9, 2023
Tunning tokenizer on my own dataset 🤗Tokenizers	0	717	January 25, 2021

Training embeddings of tokens

Related topics