Where to pick-up embedding data from BERT model?

ThetaPhiPsi · February 6, 2022, 9:11am

Hi all,

I would like to use the embeddings model developed by Emilia Alsentzer, Bio_ClinicalBERT, but I am not sure where to get the embedding data.

For now I use the model’s pooler_output, is that the correct way?

tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

inputs = tokenizer(...)
sentence_vector = model(**inputs).pooler_output

osanseviero · February 7, 2022, 8:38pm

The embedding of each token is in hidden_state.

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

text = "this is a test"
inputs = tokenizer(text, return_tensors="pt")
model(**inputs)["last_hidden_state"].shape
>>> torch.Size([1, 6, 768])

pooler_output corresponds to the CLS token which is used for classification. So it can be seen as a hidden state for the whole sentence.

pooler_output ( torch.FloatTensor of shape (batch_size, hidden_size) ) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

ThetaPhiPsi · February 8, 2022, 8:50pm

Thank you for your explanation.
So, if I consider pooler_output to be a (kind of) embedding of the whole sentence, that is the correct method.

Topic		Replies	Views
What should be used as sentence embedding for BertModel? Beginners	0	1909	May 24, 2021
How to get [CLS] embeddings from BertForTokenClassification model Beginners	3	15195	November 27, 2023
How to fine-tune a pre-trained model and then get the embeddings? Beginners	2	3757	December 20, 2022
Sentence Embeddings From Fine-Tuned BERTForSequenceClassification 🤗Transformers	1	1682	September 29, 2021
Pool [CLS] token from DistilBERT 🤗Transformers	1	793	January 18, 2022

Where to pick-up embedding data from BERT model?

Related topics