The inputs into BERT are token IDs. How do we get the corresponding input token VECTORS?

Hi, I am new and learning about transformers.

In alot of BERT tutorials i see the input is just the token id of the words. But surely we need to convert this token ID to a vector representation (it can be one hot encoding, or any initial vector representation for each token ID) so that it can be used by the model?

My question is where can I find this initial vector representation for each token? It seems like theres no guide on this hence why I am asking

1 Like

The token ID specifically is used in the embedding layer, which you can see as a matrix with as row indices all possible token IDs (so one row for each item in the total vocabulary size, for instance 30K rows). Every token therefore has a (learned!) representation. Be ware though, that this is not the same as word2vec or similar approaches - it is context-sensitive and not trained specifically to used by itself. It only serves as the the input of the model, together with potentially other embeddings like type and position embeddings. Getting those embeddings by themselves is not very useful. If you want to get output representations for each word, this post may be helpful. Generate raw word embeddings using transformer models like BERT for downstream process - #2 by BramVanroy

Thanks! So it seems like the input is literally the token ID and its just like an ordinal encoding scheme where you just represent words as id. Was confused because of this because other methods use things like bag of words, one hot encoding e.t.c. do you know of any benefits of representing words as ids? if not, thanks again!

First of all, not words but tokens. Most large language models these days use a subword tokenizer to limit the potential size of the vocabulary and avoiding out-of-vocabulary issues.

Using one hot encoding or bag of words is something completely different, for different purposes or architectures. I encourage you to follow the course to better understand how these things work.

Can you elaborate on the point about the embedding vector not being like word2vec, i.e. being context sensitive? This doesn’t sit well with my understanding of BERT.

If I take two different sentences and tokenise them such that the input_ids provide the index into the matrix that will extract the initial-layer embeddings, then if the indices are the same, then I will extract the exact same vectors for the same word across different contexts. So, if the same token ids are generated, there’s no way they can be context-sensitive and are fixed. By being fixed, I can’t see them as being not like a context-insensitive representation (such as word2vec).

If I take vblagoje/bert-english-uncased-finetuned-pos and run the following two sentences through the tokeniser, it returns the same index for the word ‘bread’ (also verified as being the case across different homophones).

Sentence 1: I like eating bread
Sentence 2: The banana bread was horrible

{'input_ids': [[101, 1045, 2066, 5983, 7852, 102], [101, 1996, 15212, 7852, 2001, 9202, 102]] ...

The word ‘bread’ is mapped to 7852 in both cases. Now, I know you’re right in this as you are the expert, but something is a bit odd in the formulation of what you said, so I just hoped you might be able to clarify this point. For me, fixed mapping to indices across different contexts represents a context-insensitive method similar to word2vec. I fully understand it’s not intended to ever be used by itself and is only the input to the Transformer model. However, this idea of it being context-sensitive is something that I think needs to be clarified here.

What I meant was that the output of the model for a given word is context-sensitive. I could have phrased that better, indeed. Of course the embedding layer is just a lookup table, but you should never just extract vectors from the embedding layer only - that completely defeats the purpose of context-sensitive models in the first place. So yes, nn.Embedding is just a lookup table. But the output of every token from every layer past the embedding layer is context-dependent.

1 Like

Thanks Bram,

That makes perfect sense now and is how I thought of it. Of course, that defeats the point of the model (extracting from the lookup table directly) but it’s a necessary step when pedagogically explaining the flow of inputs to context-sensitive outputs, i.e. what happens to those input id values and what they represent. Keep up the good work!

Hi BramVanroy,

The following is my understanding:
(Let’s call “token embedding matrix of the model” as embedding matrix)

1, The no.of tokens in the tokenizer is equal to the no.of rows in the embedding matrix.
2, Each row is a learned representation of a token.
3, When adding new tokens to the tokenizer, new rows gets randomly initialized in the embedding matrix (by using resize_token_embeddings() method).

The following are my doubts:
1, Given a particular token, what is the relation between its token ID (just a number) and learned token embedding (respective row in the embedding matrix)?
2, How the embedding matrix was trained? Using normal feed forward layer with a single input neuron and multiple output neurons (equal to the no.of columns of the embedding matrix), or which architecture ?
3, Was the embedding matrix trained parallely during pre-training? or as a step prior to pre-training? Will the complete embedding matrix be trained again during fine-tuning?
4, If I want to add a huge number of vocabulary to the tokenizer, what is an intelligent way to initialize the new rows of the embedding matrix?

Kindly correct me, if my understanding is wrong.

  1. The token ID is the row ID in the embedding matrix. So every row is a token representation
  2. and 3. In BERT, the Embedding is trained alongside the whole model. There are no separates steps involved at all. To better understand, you can read the BERT paper or look up the illustrated BERT/transformer
  3. I am not sure. By adding new tokens, you will always have the issue that your model is imbalanced: some tokens are trained and some are not. I am not sure what the best way is to deal with this.

Thank you Bram for your reply.

Answer to 3)2) How the embedding matrix was trained?

The embedding matrix is trained using an nn.Embedding layer (pytorch makes it possible to have a trainable lookup table)