Conceptual questions about transformers

Hello there,

I am struggling with two simple questions about transformers models. I hope somebody in the huggingface communitity can shed some light on these points.

  1. transformers models create contextual embeddings. That is, the embedding of a word depend on the words around it in the sentence. Does this means a word has a different embedding for every possible sentence?
  2. At inference time, I am feeding a sentence to a transformer model for classification. How can the attention mechanism work (this word is related to this word, etc) if the model has never seen the sentence? I get the intuition at training time (we minimize the loss) but what happens at inference time?

Any insights would be greatly appreciated!

1 Like
  1. The final output of a word will differ depending on its position in the sentence as well as the words surrounding it. So, yes - for every different sentence, the word will have a different output.
  2. This is the whole point of training: the model learns which words it should pay attention to and which ones it should not for a given input vector. In practice it is more mathematical than that. This illustration may help you get your head around it.

thanks @BramVanroy, that makes sense. I get the mathematical construct (with key, value, and query vectors) but I am trying to find examples where you can see the attention in action with concrete examples (say the real embeddings from a trained model). Most of the explanations online are very generic and give the intuition but sometimes you need a bit more… Perhaps you have some readings to suggest?


actually @BramVanroy something that I find hard to find is some references on transformers for text-classification (rather than translation)

hey @olaffson in case you don’t know about it, there’s a nifty tool called BertViz that provides some great examples on visualising attention (and you can interact with it!)

1 Like

Interestingly, by reading the papers closely I think the following is true: there is an embedding for each word in the model that is independent of the given sentence.

What the transformer architecture does, however, is - for each word - to create a context specific. embedding that is essentially a weighted average of the embeddings of all the words in the sentence.

The key question is: @BramVanroy can we recover these two embeddings from huggingface? The context independent and the more useful context dependent one? Does that make sense?


Thanks @lewtun I looked at the tool. I think my reply above provide some additional inputs. I would be curious to have your 50 cents too! thanks for your time again

yes this is correct, although to be precise it’s for each token (which can be a word or subword) :slight_smile:

i don’t understand this - which papers and where in them does this come from?

@lewtun my understanding is the following: consider the sentence a happy dog.

The happy token has an initial context-independent embedding that is transformed into another, context dependent, embedding via the self-attention mechanism that is essentially a linear combination such as: a*embedding('a') + b*embedding('happy') + c*embedding('dog') where a, b, c are the attention weights.

My point was to ask whether the initial embeddings (likely randomly initialized) are available and optimized at all during training?

I think it would help you to get your hands dirty with some of these models to see how they work or what they give as output. What you are describing is technically correct, but often misused.

Let’s say you load “bert-base-cased”. It will have 13 “layers”. At the very beginning you have an Embedding layer, as you mentioned. The only thing that this does is look up the (learned) representation of a token (index) during inference. During training the weights (vector values) for each token are indeed updated/trained. This is indeed static - but it is different from things like word2vec or other static word representations because it is trained as part of a whole system. It is not trained in such a way that it is intended to be used as a static vector. These token-level representations then serve as the input for the actual model (the next 12 layers), which are attention-based layers.

So you are right in saying that you start from a static token embedding for each token (often augmented with position encoding) and only later does the model use attention to get contextualized embeddings. But it is a mistake to think that those static embeddings are to be used like word2vec.

Here are some things to try

from transformers import AutoModel, AutoTokenizer
# We want the output of ALL layers not just the last one, so output_hidden_states=True
model = AutoModel.from_pretrained("bert-base-cased", output_hidden_states=True)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# We need to convert our text in subword units and assign indices to each token
# We need tensors because our model expects that (pt = PyTorch)
encoded = tokenizer("I like you .", return_tensors="pt")

# {'input_ids': tensor([[ 101,  146, 1176, 1128,  119,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])}
# What is important here are the input ids. These correspond with the IDs for the subword units the tokenizer got (and added special tokens where necessary). When given to the model, the model will look for e.g. index 101 in the Embedding layer. So the Embedding layer is just a look-up of your subword IDs

# Forward pass through the model
output = model(**encoded)

# dict_keys(['last_hidden_state', 'pooler_output', 'hidden_states'])
# We get back the hidden state of the last layer, a final "pooler" output e.g. for classification, and ALL hidden states

# 13
# We can confirm that we indeed get back the output of the 13 layers

# ...
# This will print the output of the Embedding layer (the first, static layer)

EDIT: there is a small flaw in what I wrote above, although I touched upon it. That first out of 13 outputs is indeed the output of the first “layer” (i.e. the embedding) - but in the case of BERT this does NOT only consist of a static Embedding layer. The output is the sum of the aforementioned static vectors + position embedding + type embedding. The latter is less important, but the position embedding is. What this means is that the output that you have in those final outputs of the first layer CAN differ for the same token because its positional encoding differs.

Without hacking the library a bit, I do not think it is possible out of the box to only get the static vectors without the other summed embeddings. See the implementation of the BertEmbeddings layer for mre:


fascinating. Thanks @BramVanroy this is super useful!

1 Like