Hi all,
I would like to use the embeddings model developed by Emilia Alsentzer, Bio_ClinicalBERT, but I am not sure where to get the embedding data.
For now I use the model’s pooler_output
, is that the correct way?
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
inputs = tokenizer(...)
sentence_vector = model(**inputs).pooler_output
The embedding of each token is in hidden_state.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
text = "this is a test"
inputs = tokenizer(text, return_tensors="pt")
model(**inputs)["last_hidden_state"].shape
>>> torch.Size([1, 6, 768])
pooler_output
corresponds to the CLS token which is used for classification. So it can be seen as a hidden state for the whole sentence.
- pooler_output (
torch.FloatTensor
of shape (batch_size, hidden_size)
) — Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
1 Like
Thank you for your explanation.
So, if I consider pooler_output
to be a (kind of) embedding of the whole sentence, that is the correct method.