I am trying to understand the difference between transformer encoder and decoder, after reading the article Transformer-based Encoder-Decoder Models .
Would it be correct that after bringing a causal masked to encoder only model, it will be the same as decoder only model?
according to the article:
auto-regressive models, such as GPT2, have the same architecture as transformer-based decoder models if one removes the cross-attention layer
On a side-note, autoencoding models, such as Bert, have the same architecture as transformer-based encoder models.
So, without involving cross-attention, the main difference between transformer encoder and decoder is that encoder uses bi-directional self-attention, decoder uses uni-directional self-attention layer instead.
BERT is an encoder-only model and GPT is a decoder-only model. What if I add a causal mask on BERT model to make it become decoder.
Refer to the extended attention mask on Bert. It can be done by changing the config with is_decoder.
After that, I have an experiment comparing the values of word embedding of “I” for input_ids and perturbed input_ids as same as the article. It seems that after changing BERT to a decoder, its hidden state will be changed on a different input, why is that happen ?
from transformers import AutoModel,AutoTokenizer, AutoConfig
import torch
# bert
model_config = AutoConfig.from_pretrained('bert-base-uncased')
model_config.is_decoder = True
bert_model = AutoModel.from_config(model_config)
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# gpt
gpt_model = AutoModel.from_pretrained('gpt2')
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')
Test GPT model
embeddings = gpt_model.get_input_embeddings()
# create ids of encoded input vectors
decoder_input_ids = gpt_tokenizer("<pad> Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids
# pass decoder input_ids and encoded input vectors to decoder
lm_logits = gpt_model(decoder_input_ids).last_hidden_state
# change the decoder input slightly
decoder_input_ids_perturbed = gpt_tokenizer("<pad> Ich will das", return_tensors="pt", add_special_tokens=False).input_ids
lm_logits_perturbed = gpt_model(decoder_input_ids_perturbed).last_hidden_state
# compare values of word embedding of "I" for input_ids and perturbed input_ids
print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3))
Is encoding for
Ich equal to its perturbed version?: True
Test BERT model
embeddings = bert_model.get_input_embeddings()
# create ids of encoded input vectors
decoder_input_ids = bert_tokenizer("<pad> Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids
# pass decoder input_ids and encoded input vectors to decoder
lm_logits = bert_model(decoder_input_ids).last_hidden_state
# change the decoder input slightly
decoder_input_ids_perturbed = bert_tokenizer("<pad> Ich will das", return_tensors="pt", add_special_tokens=False).input_ids
lm_logits_perturbed = bert_model(decoder_input_ids_perturbed).last_hidden_state
# compare values of word embedding of "I" for input_ids and perturbed input_ids
print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3))
Is encoding for
Ich equal to its perturbed version?: False