Please I’m not familiar with BERT, but I’ll like to train a BERT model just for word embedding (not NSP or MLM), in order to compare its impact on some task (I can give details if needed) against W2V.
In my case, I’ll like to train BERT on my dataset, but what I can find in the research is how to train BERT for MLM for example. So I don’t how to use this model to embed word.
Please feel free to share more details. It really depends a lot on the context around the sentences/words you want to embed.
Hopefully these two discussions can help:
In my case, I’d like to use BERT and W2V in Word Sense Disambiguation for ambiguous queries (Information Retrieval).
The goal with these models is to embed all context words of the ambiguous word.
So with W2V, I’ve just built a list of all sentences in my dataset and used gensim package to train a W2V model on these sentences.
In the case of BERT, I’ll like to do the same, just to pass a list of sentences and have at the end of training a model like the original BERT and embed different words
By the way, I’d like to note I’ve been restricted by my supervisor to just use the original BERT model and to train new one on my dataset.
Thank again for the first link, it’ll help me to embed word after training the new BERT on my dataset, but now I’ve to first train the model.
Hmm, have you looked at spacy-transformers? That might be a good fit for your project… Here’s also a paper I read. They tried fine-tuning BERT on the task of predicting the meaning of the ambiguous word.
That might be a bit too in-depth for what your supervisor wanted, though!
But as explained in my first message, I’m not looking for BertForMaskedLM, because I don’t aim to use it. I’ld like to use BERT just to embed words, not to predict masked words.
The only problem I’ve with all this is that, For my WSD in IR, I already have an existing unsupervised process (configured with W2V), the goal is just to see the impact of other models (especially BERT, as it is supposed to produce best result than W2V) of word embedding.
And to clarify, The process (with W2V) has already been validated with the produced results. So to use BERT I can only adapt it to that process, by embedding context words.
I think it can be useful for WSD. But, as I described, I don’t want to embed sentences but words. So using it will force me to change my approach, which I can’t because the idea here is to compare the impact of bert and W2V on this approach
Yes, that’s true and thank you.
But what I’d like is how to train BERT on a new dataset, not to use the already pre-train BERT.
And let me notice that, I have data annotated neither for NSP nor for MLM.
It why I’m asking if it’s possible and if yes, how to do it?
I think there might be a bit of confusion about what BERT is. BERT was trained using MLM and next sentence prediction. You can fine-tune using MLM alone for simplicity’s sake. Once you have finished fine-tuning, all you have to do is grab the embeddings from the model before it’s passed into the MLM head. You can do this by specifying output_hidden_states=True when calling the model.
Hello @joval I am very interest in your topic about training BERT for word embedding. If you don’t mind, I want to know the updates about this. Can BERT be used just for word embedding? Because my thesis related to this and this is my first time using BERT. The research that I find is how to train it only. I still haven’t found out how to train BERT for word embedding only. Thanks in advance!
Once you have finished fine-tuning, all you have to do is grab the embeddings from the model before it’s passed into the MLM head. You can do this by specifying output_hidden_states=True when calling the model.
Hi @joval thanks for the reply. I have followed question. After I add output_hidden_states=True parameter to my model, how can I see the result of embedding? Thank you.