The inputs into BERT are token IDs. How do we get the corresponding input token VECTORS?

The token ID specifically is used in the embedding layer, which you can see as a matrix with as row indices all possible token IDs (so one row for each item in the total vocabulary size, for instance 30K rows). Every token therefore has a (learned!) representation. Be ware though, that this is not the same as word2vec or similar approaches - it is context-sensitive and not trained specifically to used by itself. It only serves as the the input of the model, together with potentially other embeddings like type and position embeddings. Getting those embeddings by themselves is not very useful. If you want to get output representations for each word, this post may be helpful. Generate raw word embeddings using transformer models like BERT for downstream process - #2 by BramVanroy

3 Likes