In alot of BERT tutorials i see the input is just the token id of the words. But surely we need to convert this token ID to a vector representation (it can be one hot encoding, or any initial vector representation for each token ID) so that it can be used by the model?
My question is where can I find this initial vector representation for each token? It seems like theres no guide on this hence why I am asking
The token ID specifically is used in the embedding layer, which you can see as a matrix with as row indices all possible token IDs (so one row for each item in the total vocabulary size, for instance 30K rows). Every token therefore has a (learned!) representation. Be ware though, that this is not the same as word2vec or similar approaches - it is context-sensitive and not trained specifically to used by itself. It only serves as the the input of the model, together with potentially other embeddings like type and position embeddings. Getting those embeddings by themselves is not very useful. If you want to get output representations for each word, this post may be helpful. Generate raw word embeddings using transformer models like BERT for downstream process - #2 by BramVanroy
Thanks! So it seems like the input is literally the token ID and its just like an ordinal encoding scheme where you just represent words as id. Was confused because of this because other methods use things like bag of words, one hot encoding e.t.c. do you know of any benefits of representing words as ids? if not, thanks again!
First of all, not words but tokens. Most large language models these days use a subword tokenizer to limit the potential size of the vocabulary and avoiding out-of-vocabulary issues.
Using one hot encoding or bag of words is something completely different, for different purposes or architectures. I encourage you to follow the course to better understand how these things work.
Can you elaborate on the point about the embedding vector not being like word2vec, i.e. being context sensitive? This doesnât sit well with my understanding of BERT.
If I take two different sentences and tokenise them such that the input_ids provide the index into the matrix that will extract the initial-layer embeddings, then if the indices are the same, then I will extract the exact same vectors for the same word across different contexts. So, if the same token ids are generated, thereâs no way they can be context-sensitive and are fixed. By being fixed, I canât see them as being not like a context-insensitive representation (such as word2vec).
If I take vblagoje/bert-english-uncased-finetuned-pos and run the following two sentences through the tokeniser, it returns the same index for the word âbreadâ (also verified as being the case across different homophones).
Sentence 1: I like eating bread
Sentence 2: The banana bread was horrible
The word âbreadâ is mapped to 7852 in both cases. Now, I know youâre right in this as you are the expert, but something is a bit odd in the formulation of what you said, so I just hoped you might be able to clarify this point. For me, fixed mapping to indices across different contexts represents a context-insensitive method similar to word2vec. I fully understand itâs not intended to ever be used by itself and is only the input to the Transformer model. However, this idea of it being context-sensitive is something that I think needs to be clarified here.
What I meant was that the output of the model for a given word is context-sensitive. I could have phrased that better, indeed. Of course the embedding layer is just a lookup table, but you should never just extract vectors from the embedding layer only - that completely defeats the purpose of context-sensitive models in the first place. So yes, nn.Embedding is just a lookup table. But the output of every token from every layer past the embedding layer is context-dependent.
That makes perfect sense now and is how I thought of it. Of course, that defeats the point of the model (extracting from the lookup table directly) but itâs a necessary step when pedagogically explaining the flow of inputs to context-sensitive outputs, i.e. what happens to those input id values and what they represent. Keep up the good work!
The following is my understanding:
(Letâs call âtoken embedding matrix of the modelâ as embedding matrix)
1, The no.of tokens in the tokenizer is equal to the no.of rows in the embedding matrix.
2, Each row is a learned representation of a token.
3, When adding new tokens to the tokenizer, new rows gets randomly initialized in the embedding matrix (by using resize_token_embeddings() method).
The following are my doubts:
1, Given a particular token, what is the relation between its token ID (just a number) and learned token embedding (respective row in the embedding matrix)?
2, How the embedding matrix was trained? Using normal feed forward layer with a single input neuron and multiple output neurons (equal to the no.of columns of the embedding matrix), or which architecture ?
3, Was the embedding matrix trained parallely during pre-training? or as a step prior to pre-training? Will the complete embedding matrix be trained again during fine-tuning?
4, If I want to add a huge number of vocabulary to the tokenizer, what is an intelligent way to initialize the new rows of the embedding matrix?
The token ID is the row ID in the embedding matrix. So every row is a token representation
and 3. In BERT, the Embedding is trained alongside the whole model. There are no separates steps involved at all. To better understand, you can read the BERT paper or look up the illustrated BERT/transformer
I am not sure. By adding new tokens, you will always have the issue that your model is imbalanced: some tokens are trained and some are not. I am not sure what the best way is to deal with this.