I am concatenating the last 4 hidden layers of BERT to generate my embeddings, with the method from here. The embedding is a 1D numpy array of length 3,072 (4*768)
However, I can’t seem to figure out how to decode these embeddings back into sentences.
I’ve tried reshaping the embedding to work with get_output_embedding()
:
bert = transformers.BertForMaskedLM.from_pretrained("bert-base-uncased")
tok = transformers.BertTokenizer.from_pretrained("bert-base-uncased")
dec = bert.get_output_embeddings()(torch.from_numpy(embedding.reshape(4,768)).float())
print("Decoded sentence:", tok.decode(dec.softmax(0).argmax(1)))
Although the output of this code indeed returns a string of characters, these characters are not the original sentence.
How can I decode the embeddings (generated from the last 4 hidden layers of BERT)?