Difference between transformer encoder and decoder

voidful · March 3, 2021, 6:20pm

I am trying to understand the difference between transformer encoder and decoder, after reading the article Transformer-based Encoder-Decoder Models .

Would it be correct that after bringing a causal masked to encoder only model, it will be the same as decoder only model？

according to the article:

auto-regressive models, such as GPT2, have the same architecture as transformer-based decoder models if one removes the cross-attention layer

On a side-note, autoencoding models, such as Bert, have the same architecture as transformer-based encoder models.

So, without involving cross-attention, the main difference between transformer encoder and decoder is that encoder uses bi-directional self-attention, decoder uses uni-directional self-attention layer instead.

BERT is an encoder-only model and GPT is a decoder-only model. What if I add a causal mask on BERT model to make it become decoder.

Refer to the extended attention mask on Bert. It can be done by changing the config with is_decoder.

github.com

huggingface/transformers/blob/b70f441b72accf3205185290efc563c0dea65bfc/src/transformers/models/bert/modeling_bert.py#L940


      
          # past_key_values_length
          past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
          
          if attention_mask is None:
              attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)
          if token_type_ids is None:
              token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
          
          # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
          # ourselves in which case we just need to make it broadcastable to all heads.
          extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
          
          # If a 2D or 3D attention mask is provided for the cross-attention
          # we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
          if self.config.is_decoder and encoder_hidden_states is not None:
              encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states.size()
              encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
              if encoder_attention_mask is None:
                  encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
              encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
          else:

github.com

huggingface/transformers/blob/4b919657313103f1ee903e32a9213b48e6433afe/src/transformers/modeling_utils.py#L221


      
                  encoder_extended_attention_mask = (1.0 - encoder_extended_attention_mask) * -1e9
              else:
                  raise ValueError(
                      "{} not recognized. `dtype` should be set to either `torch.float32` or `torch.float16`".format(
                          self.dtype
                      )
                  )
          
              return encoder_extended_attention_mask
          
          def get_extended_attention_mask(self, attention_mask: Tensor, input_shape: Tuple[int], device: device) -> Tensor:
              """
              Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
          
              Arguments:
                  attention_mask (:obj:`torch.Tensor`):
                      Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
                  input_shape (:obj:`Tuple[int]`):
                      The shape of the input to the model.
                  device: (:obj:`torch.device`):
                      The device of the input to the model.

After that, I have an experiment comparing the values of word embedding of “I” for input_ids and perturbed input_ids as same as the article. It seems that after changing BERT to a decoder, its hidden state will be changed on a different input, why is that happen ?

from transformers import AutoModel,AutoTokenizer, AutoConfig
import torch

# bert
model_config = AutoConfig.from_pretrained('bert-base-uncased')
model_config.is_decoder = True
bert_model = AutoModel.from_config(model_config)
bert_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# gpt
gpt_model = AutoModel.from_pretrained('gpt2')
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')

Test GPT model

embeddings = gpt_model.get_input_embeddings()

# create ids of encoded input vectors
decoder_input_ids = gpt_tokenizer("<pad> Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids

# pass decoder input_ids and encoded input vectors to decoder
lm_logits = gpt_model(decoder_input_ids).last_hidden_state

# change the decoder input slightly
decoder_input_ids_perturbed = gpt_tokenizer("<pad> Ich will das", return_tensors="pt", add_special_tokens=False).input_ids
lm_logits_perturbed = gpt_model(decoder_input_ids_perturbed).last_hidden_state

# compare values of word embedding of "I" for input_ids and perturbed input_ids
print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3))

Is encoding for Ich equal to its perturbed version?: True

Test BERT model

embeddings = bert_model.get_input_embeddings()

# create ids of encoded input vectors
decoder_input_ids = bert_tokenizer("<pad> Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids

# pass decoder input_ids and encoded input vectors to decoder
lm_logits = bert_model(decoder_input_ids).last_hidden_state

# change the decoder input slightly
decoder_input_ids_perturbed = bert_tokenizer("<pad> Ich will das", return_tensors="pt", add_special_tokens=False).input_ids
lm_logits_perturbed = bert_model(decoder_input_ids_perturbed).last_hidden_state

# compare values of word embedding of "I" for input_ids and perturbed input_ids
print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3))

Is encoding for Ich equal to its perturbed version?: False

voidful · March 12, 2021, 3:26pm

It is because of the dropout.

github.com/huggingface/transformers

[Causal Language Modeling] seems not as expected

opened 03:36PM - 06 Mar 21 UTC

closed 07:02AM - 12 Mar 21 UTC

voidful

# Problem Causal Models is only attended to the left context. Therefore causal… models should not depend on the right tokens. For example, The word embedding of "I" will be unchanged no matter what is in the right In GPT2. Since Causal Language Model are uni-directional self-attention. ``` from transformers import AutoModel,AutoTokenizer, AutoConfig import torch # gpt gpt_model = AutoModel.from_pretrained('gpt2') gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2') embeddings = gpt_model.get_input_embeddings() # create ids of encoded input vectors decoder_input_ids = gpt_tokenizer("<pad> Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids # pass decoder input_ids and encoded input vectors to decoder lm_logits = gpt_model(decoder_input_ids).last_hidden_state # change the decoder input slightly decoder_input_ids_perturbed = gpt_tokenizer("<pad> Ich will das", return_tensors="pt", add_special_tokens=False).input_ids lm_logits_perturbed = gpt_model(decoder_input_ids_perturbed).last_hidden_state # compare values of word embedding of "I" for input_ids and perturbed input_ids print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3)) ``` Result ``` Is encoding for `Ich` equal to its perturbed version?: True ``` However, when it comes to other models, the result is not following the assumption, the logits will be changed when changing the right side input? What is the reason? Is it a bug? I really want to know the answer, thank you! BERT ``` Is encoding for `Ich` equal to its perturbed version?: False ``` BART ``` Is encoding for `Ich` equal to its perturbed version?: False ``` Roberta ``` Is encoding for `Ich` equal to its perturbed version?: False ``` Experiment notebook [colab](https://colab.research.google.com/drive/15V37RWAL40vhrk-uBIh9m99j1gZMLjUy?usp=sharing) ## Environment info - `transformers` version: 4.3.3 - Platform: Linux-4.19.112+-x86_64-with-Ubuntu-18.04-bionic - Python version: 3.7.10 - PyTorch version (GPU?): 1.7.1+cu101 (False) - Tensorflow version (GPU?): 2.4.1 (False) - Using GPU in script?: <fill in> - Using distributed or parallel set-up in script?: <fill in> ### Who can help - @patrickvonplaten - @LysandreJik - @patil-suraj  ## Information Model I am using (GPT, Bert, RoBerta, BART ForCausalLM): The problem arises when using: * [ x] the official example scripts: https://huggingface.co/blog/encoder-decoder#decoder ## To reproduce Experiment notebook [colab](https://colab.research.google.com/drive/15V37RWAL40vhrk-uBIh9m99j1gZMLjUy?usp=sharing) ## Expected behavior Causal Models should not be affected by the right context?

Topic		Replies	Views
GPT-GPT encoder decoder 🤗Transformers	0	287	May 4, 2021
Causal masks in BERT vs. GPT2 Intermediate	4	2729	December 30, 2022
BERT vs GPT architectural, conceptual and implemetational differences 🤗Transformers	0	1000	November 26, 2021
Forward-looking or left-context attention mask (left-to-right) generation with BertGeneration and RobertaForCausalLM 🤗Transformers	3	1354	October 27, 2020
Difference Between Attention Mask and Causal Mask 🤗Transformers	1	7122	September 2, 2024

Difference between transformer encoder and decoder

Related topics