Obtaining word-embeddings from Roberta

Hello Everyone,
I am fine-tuning a pertained masked LM (distil-roberta) on a custom dataset. Post-training, I would like to use the word embeddings in a downstream task. How does one go about obtaining embeddings for whole-words when the model uses sub-word tokenising. For example, tokeniser.tokenize(‘floral’) will give me [‘fl’, ‘oral’]. So, if ‘floral’ is not even a part of the vocabulary, how do I obtain its embedding?

When I do this:

tokens = tokenizer.encode("floral")
word = tokenizer.encode("floral",return_tensors='pt')
output = model(word)

I see that output is a tensor with shape torch.Size([1, 4, 50265]) and rightly so, because a LM will output the probability distribution across all the words in the vocabulary. I am expecting something like [1,768]. Can someone please help?

1 Like

hey @okkular i believe the standard approach to dealing with this is to simply average the token embeddings of the subwords to generate an embedding for the whole word. having said that, aggregating the embeddings this way might have a negative effect on your downstream performance, so trying both approaches would be a good test :slight_smile:

In addition to the answer by @lewtun I would not recommend to get context-less embeddings this way. The whole point of context-sensitive models is that the representation depends on the context. The meaning of “floral” (and any word, basically) depends on its context. If you use these LMs in this way, I fear your output representation may not be as stable/good as you’d want.

If you are indeed looking for single word representations without context, it might be better to rely on context-free representations as per word2vec/GloVe embeddings.

1 Like

Hey guys,

These are really good responses.

I am also doing something similar. I am training a Language Model using RoBERTa on a fashion items’ description corpus. This way the model should learn embeddings for many common fashion terms like dresses, pants etc. and more specifically, their sub-types like floral dress, abstract dress, animal dress etc. The embeddings obtained in this way should be context-aware since they were trained on such specific data.

Next, I have an image classifier which tells me given a product image which category i.e. floral, animal etc. it belongs to. I was thinking of using these text embeddings along with the feature vectors of the image classifier to obtain a dense representation for a particular product.

So, my question is on the same lines as okkular. @lewtun, one way as you mentioned was to aggregate the subword embeddings to get one single embedding; however you’ve mentioned in your reply to okkular

Can you please elaborate what is the other approach please?

Also @BramVanroy does the approach that I mentioned above take into account the context of the fashion language for the downstream task of dense vector representation for a product? I can understand using pretrained BERT/ RoBERTa might not be a good idea but after fine-tuning it this way, it should capture the context alright, correct?

Thanks & Regards,

Yes, fine-tuning would definitely improve things. My point was mainly that using pretrained models as-is for context-free word embeddings is perhaps not the best idea.

1 Like

sorry for not being clear: all i meant was comparing token-level embeddings vs “word”-level embeddings (obtained via aggregation). as @BramVanroy nicely explained, the former is more desirable in your context :slight_smile:


Can you elaborate how we can individually use the token-level embeddings to represent a word? Do we stack the embeddings? If so, then different words will have different embedding lengths… How do we unify these embeddings to use it in a downstream task?

Thanks & Regards,

@lewtun @BramVanroy - thank you! Here is what I have gathered from your responses:

  1. We can aggregate sub-word embeddings to obtain word embeddings, but the performance impact needs to be tested on the down-stream task.
  2. Context insensitive embeddings from BERT etc will perform worse than word2vec, glove, etc. I remember hearing this point in Nils Reimers’ video on sentence transformer.

But I am still not clear about two things:

  1. Isn’t it a common task to obtain word embeddings from a fine-tuned LM and then use them for a specific task - like what ElisonSherton has described? Can you please point me to better approaches apart from just aggregating the sub-word embeddings?
  2. Also, some code snippet showing how to extract the embeddings from an LM will be helpful.


First two rules of research: if something is common does not mean that it is SOTA, and what is SOTA in one task is not SOTA in another. We are still seeing feature-based systems outperform LMs in some cases (esp. low resource).

It might work on your task. It might even work well. If your model is specifically finetuned on single words, then it should be fine. But using a pretrained model as-is (without finetuning) to input a single word and get its embedding… I am hesitant to recommend that for the reason discussed above (most LMs are context sensitive so it is senseless to get context free representations out of it). Instead I would recommend word2vec/GloVe.

In a previous post I wrote how you can extract the embeddings from a given word in an input sentence by averaging the subword logits. Generate raw word embeddings using transformer models like BERT for downstream process - Beginners - Hugging Face Forums HTIH

@BramVanroy - thank you!

If your model is specifically finetuned on single words, then it should be fine

So, when you say the above, I am assuming something like this:

    #new_vocab contains words specific to my dataset
    new_vocab = open('/tmp/new-vocab.txt', 'r').read().split('\n')

    print('Adding new tokens to vocab')

    training_args = TrainingArguments(
    evaluation_strategy = "epoch",
   trainer = Trainer(

Basically, adding the new words to the vocab and fine-tuning RoBerta further using MLM only.

Definitely not. What you are doing is not fine-tuning, you are adding more tokens to the existing vocabulary. Those added tokens will be randomly initialized and not contain any meaningful representation. The language model was never taught how those words need to be represented. If on top of this you also tune the model on a word-only task, then it might work.

What I meant was having a fine-tune task with a large corpus where you specifically have on word as input in RoBERTa and some expected output. I would not know which tasks you could use for this, but theoretically it could work. But you would need to finetune the model, which you do not seem to want to do.

Again: in your case it might be a better idea to stick with content-free representations. BERT and friends are not intended for context-free representations.

Please see the updated code snippet. Sorry for not being explicit previously - what I am planning to do is:

  1. collecting domain specific data.
  2. adding new words to the vocab
  3. then, finetuning RoBerta for MLM (say with 20% masking probability)
  4. finally, extract word embeddings for some important words to be used in a down-stream task.

Will this not yield good word embeddings which are usable down-stream?

Even in that case, you have the same problem:

  • you train on a corpus with context (e.g. sentences). The representation of a word will differ depending its context in the sentence
  • then you want to get representations by using a single word as input. There is no context. So the representation might be suboptimal because it was never trained to “have meaning” for a word without context

But as said before many times: the best way to be sure is just try it and compare with something as word2vec on your downstream task.

excuse me how did you specify the new_vocab that should be add