Training BERT for word embedding

Hello everyone,

Please I’m not familiar with BERT, but I’ll like to train a BERT model just for word embedding (not NSP or MLM), in order to compare its impact on some task (I can give details if needed) against W2V.

In my case, I’ll like to train BERT on my dataset, but what I can find in the research is how to train BERT for MLM for example. So I don’t how to use this model to embed word.

Can someone help me please, to archive this goal?

Please feel free to share more details. :slight_smile: It really depends a lot on the context around the sentences/words you want to embed.
Hopefully these two discussions can help: :hugs:

Hello @Nlpeva and thanks for your response.

In my case, I’d like to use BERT and W2V in Word Sense Disambiguation for ambiguous queries (Information Retrieval).
The goal with these models is to embed all context words of the ambiguous word.

  • So with W2V, I’ve just built a list of all sentences in my dataset and used gensim package to train a W2V model on these sentences.

  • In the case of BERT, I’ll like to do the same, just to pass a list of sentences and have at the end of training a model like the original BERT and embed different words

By the way, I’d like to note I’ve been restricted by my supervisor to just use the original BERT model and to train new one on my dataset.

Thank again for the first link, it’ll help me to embed word after training the new BERT on my dataset, but now I’ve to first train the model.

Hmm, have you looked at spacy-transformers? That might be a good fit for your project… Here’s also a paper I read. They tried fine-tuning BERT on the task of predicting the meaning of the ambiguous word.

That might be a bit too in-depth for what your supervisor wanted, though! :hugs:

hi @joval

There HF docs show BertForMaskedLM parameter and output.

you can train BERT MLM from scratch with that class.

Thanks for nielsr, there some good tutorial of fine tuning BERT with HF.

It will be help to you underestand whole train structure.


Hi @cog .

Thank you for your response.

But as explained in my first message, I’m not looking for BertForMaskedLM, because I don’t aim to use it. I’ld like to use BERT just to embed words, not to predict masked words.

Thank you @Nlpeva for all these resources.

The only problem I’ve with all this is that, For my WSD in IR, I already have an existing unsupervised process (configured with W2V), the goal is just to see the impact of other models (especially BERT, as it is supposed to produce best result than W2V) of word embedding.

And to clarify, The process (with W2V) has already been validated with the produced results. So to use BERT I can only adapt it to that process, by embedding context words.

What do you think about using Sentence-BERT?

Hi @Nlpeva ,

I think it can be useful for WSD. But, as I described, I don’t want to embed sentences but words. So using it will force me to change my approach, which I can’t because the idea here is to compare the impact of bert and W2V on this approach

Hi. Sentence BERT is useful for words as well as sentences. You can use it to get word embeddings instead of sentence embeddings.

Hi @pritamdeka ,

Yes, that’s true and thank you.
But what I’d like is how to train BERT on a new dataset, not to use the already pre-train BERT.
And let me notice that, I have data annotated neither for NSP nor for MLM.
It why I’m asking if it’s possible and if yes, how to do it?

Would it be possible to know how the annotated data looks like?

For which task?

I think there might be a bit of confusion about what BERT is. BERT was trained using MLM and next sentence prediction. You can fine-tune using MLM alone for simplicity’s sake. Once you have finished fine-tuning, all you have to do is grab the embeddings from the model before it’s passed into the MLM head. You can do this by specifying output_hidden_states=True when calling the model.

Read more in the docs!

Hello @sdegrace

Thank you for your reply. Let me explore this.