Extracting token embeddings from pretrained language models

you can also extract the embedding from any llm model …
as they are held in the penultimate layer on output,:
First you would export the vocabulary then send each vocab token to the model to get its respective embedding keeping them to save at the end:
I personally did this for my models also : As you will notice they are , Tokenizer related : so the number of tokens in the vocabulary reflect the number of embed dings you will return : as well as the size is related to the width of the model!
So , if you would need to Use these embedding later separately from the model, you would also need the associated . tokenzer to tokenizer your document first and present these tokens to your extracted model…
As you know these are actually open source models:
So : Disclaimer: they may not be the same as the associated professional embedding offered from the models origin: Ie: Mistral API… they may even differ model to model ? ie 4x7b or 13b may have different embedding ? as these embedding may change after fine tuning again… hence also embedding can also be volatile!


 Get the vocabulary tokens
  vocab_tokens = tokenizer.get_vocab().keys()

  # Convert vocabulary tokens to a list
  vocab_tokens_list = list(vocab_tokens)

  # Get the embeddings for each vocabulary token
  embeddings_list = []
  for token in vocab_tokens_list:
      tokens = tokenizer(token, return_tensors="pt")
      embeddings = model(**tokens).past_key_values[0][0].squeeze().tolist()
      embeddings_list.append(embeddings)


something like that …
it may take an hr? 32,000 tokens in a basic mistral: and llama even more ?
Question is : Which is better ? more tokens or less ? Word tokens or sentence tokens or word_gram tokens or BPE ?
If initializing a model from scratch then it woujld be prudent to begin with a Custom Tokenizer ? … ie you could now use your personalized vocabulary and corpus to train your tokenizer model first before adding it to your newly instantiated model ? then when you train your model you would be using your tokenizer , ie: Multi-lingual? mistral uses the llama tokenizer ? why not the more (taught) Berts ? is there something to GAIN by using custom tokenizers ? or should in the end they converge to the same ?? especially if you use BPE? where are the embeddings as we use the toknizer only to produce token embeddings ? but if our token embeddings have already been given some type of BOOST with training so essentially the output of the tokenizer is actually meaningfull embedding then the models should essentially have two layers of embeddings ?
Since embedding are word to word matrixes , the final layer of the model is actually the last embedding table in the model as at each layer the embedding atake new shape ? hence taking the last layer and not the input embddings layer ?

Not sure if im confused or not here ?

1 Like