Where to pick-up embedding data from BERT model?

Hi all,

I would like to use the embeddings model developed by Emilia Alsentzer, Bio_ClinicalBERT, but I am not sure where to get the embedding data.

For now I use the model’s pooler_output, is that the correct way?

tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

inputs = tokenizer(...)
sentence_vector = model(**inputs).pooler_output

The embedding of each token is in hidden_state.

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

text = "this is a test"
inputs = tokenizer(text, return_tensors="pt")
model(**inputs)["last_hidden_state"].shape
>>> torch.Size([1, 6, 768])

pooler_output corresponds to the CLS token which is used for classification. So it can be seen as a hidden state for the whole sentence.

  • pooler_output ( torch.FloatTensor of shape (batch_size, hidden_size) ) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.

Thank you for your explanation.
So, if I consider pooler_output to be a (kind of) embedding of the whole sentence, that is the correct method.