Difference between transformer encoder and decoder

I am trying to understand the difference between transformer encoder and decoder, after reading the article Transformer-based Encoder-Decoder Models .

Would it be correct that after bringing a causal masked to encoder only model, it will be the same as decoder only model?

according to the article:

auto-regressive models, such as GPT2, have the same architecture as transformer-based decoder models if one removes the cross-attention layer

On a side-note, autoencoding models, such as Bert, have the same architecture as transformer-based encoder models.

So, without involving cross-attention, the main difference between transformer encoder and decoder is that encoder uses bi-directional self-attention, decoder uses uni-directional self-attention layer instead.

BERT is an encoder-only model and GPT is a decoder-only model. What if I add a causal mask on BERT model to make it become decoder.

Refer to the extended attention mask on Bert. It can be done by changing the config with is_decoder.

After that, I have an experiment comparing the values of word embedding of “I” for input_ids and perturbed input_ids as same as the article. It seems that after changing BERT to a decoder, its hidden state will be changed on a different input, why is that happen ?

from transformers import AutoModel,AutoTokenizer, AutoConfig
import torch

# bert
model_config = AutoConfig.from_pretrained('bert-base-uncased')
model_config.is_decoder = True
bert_model = AutoModel.from_config(model_config)
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# gpt
gpt_model = AutoModel.from_pretrained('gpt2')
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')

Test GPT model

embeddings = gpt_model.get_input_embeddings()

# create ids of encoded input vectors
decoder_input_ids = gpt_tokenizer("<pad> Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids

# pass decoder input_ids and encoded input vectors to decoder
lm_logits = gpt_model(decoder_input_ids).last_hidden_state

# change the decoder input slightly
decoder_input_ids_perturbed = gpt_tokenizer("<pad> Ich will das", return_tensors="pt", add_special_tokens=False).input_ids
lm_logits_perturbed = gpt_model(decoder_input_ids_perturbed).last_hidden_state

# compare values of word embedding of "I" for input_ids and perturbed input_ids
print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3))

Is encoding for Ich equal to its perturbed version?: True

Test BERT model

embeddings = bert_model.get_input_embeddings()

# create ids of encoded input vectors
decoder_input_ids = bert_tokenizer("<pad> Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids

# pass decoder input_ids and encoded input vectors to decoder
lm_logits = bert_model(decoder_input_ids).last_hidden_state

# change the decoder input slightly
decoder_input_ids_perturbed = bert_tokenizer("<pad> Ich will das", return_tensors="pt", add_special_tokens=False).input_ids
lm_logits_perturbed = bert_model(decoder_input_ids_perturbed).last_hidden_state

# compare values of word embedding of "I" for input_ids and perturbed input_ids
print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3))

Is encoding for Ich equal to its perturbed version?: False

It is because of the dropout.

2 Likes