Hello! I have a problem with embeddings in model SmilesClassificationModel. This model is heir of BaseModel from simpletransformers. I want to get embeddings from last hidden state, but I can’t do it, because my output has no param last_hidden_state.
But I got the embeddings this way:
1. Load model
trained_yield_bert = SmilesClassificationModel('bert', model_path,
num_labels=1,
args={"regression": True,
'config': {"output_hidden_states": True}},
use_cuda=False,
)
tokenizer1 = AutoTokenizer.from_pretrained(model_path)
2. Inputs
test_df.head(1).labels.values is an ordinary SMILES row
bert_inputs = tokenizer1.batch_encode_plus(str(test_df.head(1).labels.values),
max_length=trained_yield_bert.config.max_position_embeddings,
padding=True,
truncation=True,
pad_to_max_length=True,
return_tensors='pt')
bert_inputs
{'input_ids': tensor([[12, 11, 13, ..., 0, 0, 0],
[12, 11, 13, ..., 0, 0, 0],
[12, 24, 13, ..., 0, 0, 0],
...,
[12, 43, 13, ..., 0, 0, 0],
[12, 98, 13, ..., 0, 0, 0],
[12, 11, 13, ..., 0, 0, 0]]), 'token_type_ids': tensor([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
...,
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0],
[1, 1, 1, ..., 0, 0, 0]])}
3. Outputs
with torch.no_grad():
output = trained_yield_bert.model(**bert_inputs)
embeddings = output[0].squeeze().cpu().numpy().tolist()
embeddings
[0.672431230545044,
0.672431230545044,
0.8746748566627502,
0.6140751242637634,
0.5577840805053711,
0.522050142288208,
0.6576945781707764,
0.6140751242637634,
0.5635161995887756,
0.5149366855621338,
0.5635161995887756,
0.672431230545044]
4. Questions
Output has a dimensionality of 2. The first dimension is presented above, the second one has a dimensionality 13 x 12 x 512 x 256.
I want to understand, which of these data are embeddings.