Conceptual questions about transformers

olaffson · August 25, 2021, 12:52pm

Hello there,

I am struggling with two simple questions about transformers models. I hope somebody in the huggingface communitity can shed some light on these points.

transformers models create contextual embeddings. That is, the embedding of a word depend on the words around it in the sentence. Does this means a word has a different embedding for every possible sentence?
At inference time, I am feeding a sentence to a transformer model for classification. How can the attention mechanism work (this word is related to this word, etc) if the model has never seen the sentence? I get the intuition at training time (we minimize the loss) but what happens at inference time?

Any insights would be greatly appreciated!
Thanks!

BramVanroy · August 25, 2021, 1:17pm

The final output of a word will differ depending on its position in the sentence as well as the words surrounding it. So, yes - for every different sentence, the word will have a different output.
This is the whole point of training: the model learns which words it should pay attention to and which ones it should not for a given input vector. In practice it is more mathematical than that. This illustration may help you get your head around it.

olaffson · August 25, 2021, 1:42pm

thanks @BramVanroy, that makes sense. I get the mathematical construct (with key, value, and query vectors) but I am trying to find examples where you can see the attention in action with concrete examples (say the real embeddings from a trained model). Most of the explanations online are very generic and give the intuition but sometimes you need a bit more… Perhaps you have some readings to suggest?

Thanks!

olaffson · August 25, 2021, 2:29pm

actually @BramVanroy something that I find hard to find is some references on transformers for text-classification (rather than translation)

lewtun · August 25, 2021, 5:33pm

hey @olaffson in case you don’t know about it, there’s a nifty tool called BertViz that provides some great examples on visualising attention (and you can interact with it!)

olaffson · August 25, 2021, 5:34pm

Interestingly, by reading the papers closely I think the following is true: there is an embedding for each word in the model that is independent of the given sentence.

What the transformer architecture does, however, is - for each word - to create a context specific. embedding that is essentially a weighted average of the embeddings of all the words in the sentence.

The key question is: @BramVanroy can we recover these two embeddings from huggingface? The context independent and the more useful context dependent one? Does that make sense?

thanks!

olaffson · August 25, 2021, 5:35pm

Thanks @lewtun I looked at the tool. I think my reply above provide some additional inputs. I would be curious to have your 50 cents too! thanks for your time again

lewtun · August 25, 2021, 5:38pm

yes this is correct, although to be precise it’s for each token (which can be a word or subword)

i don’t understand this - which papers and where in them does this come from?

olaffson · August 25, 2021, 5:44pm

@lewtun my understanding is the following: consider the sentence a happy dog.

The happy token has an initial context-independent embedding that is transformed into another, context dependent, embedding via the self-attention mechanism that is essentially a linear combination such as: a*embedding('a') + b*embedding('happy') + c*embedding('dog') where a, b, c are the attention weights.

My point was to ask whether the initial embeddings (likely randomly initialized) are available and optimized at all during training?

BramVanroy · August 25, 2021, 6:31pm

I think it would help you to get your hands dirty with some of these models to see how they work or what they give as output. What you are describing is technically correct, but often misused.

Let’s say you load “bert-base-cased”. It will have 13 “layers”. At the very beginning you have an Embedding layer, as you mentioned. The only thing that this does is look up the (learned) representation of a token (index) during inference. During training the weights (vector values) for each token are indeed updated/trained. This is indeed static - but it is different from things like word2vec or other static word representations because it is trained as part of a whole system. It is not trained in such a way that it is intended to be used as a static vector. These token-level representations then serve as the input for the actual model (the next 12 layers), which are attention-based layers.

So you are right in saying that you start from a static token embedding for each token (often augmented with position encoding) and only later does the model use attention to get contextualized embeddings. But it is a mistake to think that those static embeddings are to be used like word2vec.

Here are some things to try

from transformers import AutoModel, AutoTokenizer
# We want the output of ALL layers not just the last one, so output_hidden_states=True
model = AutoModel.from_pretrained("bert-base-cased", output_hidden_states=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# We need to convert our text in subword units and assign indices to each token
# We need tensors because our model expects that (pt = PyTorch)
encoded = tokenizer("I like you .", return_tensors="pt")

print(encoded)
# {'input_ids': tensor([[ 101,  146, 1176, 1128,  119,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
# What is important here are the input ids. These correspond with the IDs for the subword units the tokenizer got (and added special tokens where necessary). When given to the model, the model will look for e.g. index 101 in the Embedding layer. So the Embedding layer is just a look-up of your subword IDs

# Forward pass through the model
output = model(**encoded)

print(output.keys())
# dict_keys(['last_hidden_state', 'pooler_output', 'hidden_states'])
# We get back the hidden state of the last layer, a final "pooler" output e.g. for classification, and ALL hidden states

print(len(output.hidden_states))
# 13
# We can confirm that we indeed get back the output of the 13 layers

print(output.hidden_states[0])
# ...
# This will print the output of the Embedding layer (the first, static layer)

EDIT: there is a small flaw in what I wrote above, although I touched upon it. That first out of 13 outputs is indeed the output of the first “layer” (i.e. the embedding) - but in the case of BERT this does NOT only consist of a static Embedding layer. The output is the sum of the aforementioned static vectors + position embedding + type embedding. The latter is less important, but the position embedding is. What this means is that the output that you have in those final outputs of the first layer CAN differ for the same token because its positional encoding differs.

Without hacking the library a bit, I do not think it is possible out of the box to only get the static vectors without the other summed embeddings. See the implementation of the BertEmbeddings layer for mre:

github.com

huggingface/transformers/blob/72eefb34a9f24f834a8a855ab6e0ed1cc7568af8/src/transformers/models/bert/modeling_bert.py#L167

    
      
                          pointer.shape == array.shape
                      ), f"Pointer shape {pointer.shape} and array shape {array.shape} mismatched"
                  except AssertionError as e:
                      e.args += (pointer.shape, array.shape)
                      raise
                  logger.info(f"Initialize PyTorch weight {name}")
                  pointer.data = torch.from_numpy(array)
              return model
          
          

          
class BertEmbeddings(nn.Module):
              """Construct the embeddings from word, position and token_type embeddings."""
          
          
    def __init__(self, config):
                  super().__init__()
                  self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
                  self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
                  self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
          
          
        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
                  # any TensorFlow checkpoint file

olaffson · August 26, 2021, 2:53pm

fascinating. Thanks @BramVanroy this is super useful!

Topic		Replies	Views
Generate raw word embeddings using transformer models like BERT for downstream process Beginners	9	40034	October 4, 2021
Correct interpretation of the model embbedings output Beginners	1	238	May 26, 2021
Using BERT embeddings as input for transformer architecture 🤗Transformers	0	722	June 23, 2022
Are transformer-based encoders just "text embeddings"? Beginners	0	1287	March 13, 2023
Feed output from one transformer model as input to another 🤗Transformers	1	1105	July 30, 2021

Conceptual questions about transformers

Related topics