What is an embedding?

vinven7 · June 25, 2024, 2:44pm

I am confused about how to determine the best embedding for a given entity from an LLM. If i run:

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto")
outputs = model(**inputs, output_hidden_states=True)
hidden_states = outputs.hidden_states

For Llama2 and Mistral, there are 33 hidden states each of shape [batch_size, number_of_tokens, embedding_size]. Ultimately, when I think of “an” embedding, I am expecting something with the shape [batch_size, embedding_size]. (or just [1, embedding_size]). How do I do this conversion?

Averaging and Pooling

The function get_pooling from here seems to suggest that there could be several ways of doing this. Here is a code snippet:

:param outputs:  torch.Tensor. Model outputs (without pooling)
    :param inputs:  Dict. Model inputs
    :param pooling_strategy:  str. Pooling strategy ['cls', 'cls_avg', 'cls_max', 'last', 'avg', 'max', 'all', index]
    :param padding_strategy:  str. Padding strategy of tokenizers (`left` or `right`).
        It can be obtained by `tokenizer.padding_side`.
    """
    if pooling_strategy == 'cls':
        outputs = outputs[:, 0]
    elif pooling_strategy == 'cls_avg':
        avg = torch.sum(
            outputs * inputs["attention_mask"][:, :, None], dim=1) / torch.sum(inputs["attention_mask"])
        outputs = (outputs[:, 0] + avg) / 2.0
    elif pooling_strategy == 'cls_max':
        maximum, _ = torch.max(outputs * inputs["attention_mask"][:, :, None], dim=1)
        outputs = (outputs[:, 0] + maximum) / 2.0
    elif pooling_strategy == 'last':
        batch_size = inputs['input_ids'].shape[0]
        sequence_lengths = -1 if padding_strategy == 'left' else inputs["attention_mask"].sum(dim=1) - 1
        outputs = outputs[torch.arange(batch_size, device=outputs.device), sequence_lengths]

Why are the outputs being averaged with the attention_mask? The HF outputs shown in ‘ouputs’ above only give the output for the sequence_length so in that case can we just skip the attention_mask?

Finally, given 2 above, what do current LLMs output as an embedding? For example OpenAI has an embeddings API that returns the embedding of a text. What approach are they using to calculate this embedding?

What would you consider as the embedding of a string from an LLM?

nexync · June 25, 2024, 6:23pm

A string input s passing through a Llama model first gets tokenized via the Llama tokenizer into a vector of 1 x num_tokens. It then is embedded into latent space (1 x num_tokens x embedding_size) via the Embedding layer of the model. This embedding is also the first hidden state that you can see with hidden_states[0]. Then, this embedding gets passed through 32 consecutive Llama layers consisting of self-attention, layernorms. The output of the i-th layer is hidden_states[i].

As suggested by the code, to get a representation of the entire string, you can pool the embeddings of individual tokens that make up the string. The outputs are likely averaged with the attention mask to remove any padding tokens that could have been included with the input.

LLUMOAI · June 26, 2024, 1:30pm

Hi @vinven7

Embeddings from LLMs like Llama2 and Mistral are derived from multi-layered hidden states, typically shaped as [batch_size, number_of_tokens, embedding_size].

To obtain a single embedding per sequence, pooling strategies like averaging or max-pooling are applied, often using attention_mask to handle variable-length sequences.

Platforms like OpenAI use similar techniques to generate embeddings that summarize semantic and contextual information from text inputs.

Hope this helps

swtb · July 4, 2024, 8:30am

I wrote an article a while back which may help you to conceptualise an embedding. But in short, they are just a representation of the input that contains semantic and contextual information.

What is an Embedding Anyway? | N E R | D S (medium.com)

KateWinslet · July 22, 2024, 9:33am

vinven7:

I am confused about how to determine the best embedding for a given entity from an LLM. If i run:
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto")
outputs = model(**inputs, output_hidden_states=True)
hidden_states = outputs.hidden_states
For Llama2 and Mistral, there are 33 hidden states each of shape [batch_size, number_of_tokens, embedding_size]. Ultimately, when I think of “an” embedding, I am expecting something with the shape [batch_size, embedding_size]. (or just [1, embedding_size]). How do I do this conversion?

Averaging and Pooling

The function get_pooling from here seems to suggest that there could be several ways of doing this. Here is a code snippet:
:param outputs:  torch.Tensor. Model outputs (without pooling)
    :param inputs:  Dict. Model inputs
    :param pooling_strategy:  str. Pooling strategy ['cls', 'cls_avg', 'cls_max', 'last', 'avg', 'max', 'all', index]
    :param padding_strategy:  str. Padding strategy of tokenizers (`left` or `right`).
        It can be obtained by `tokenizer.padding_side`.
    """
    if pooling_strategy == 'cls':
        outputs = outputs[:, 0]
    elif pooling_strategy == 'cls_avg':
        avg = torch.sum(
            outputs * inputs["attention_mask"][:, :, None], dim=1) / torch.sum(inputs["attention_mask"])
        outputs = (outputs[:, 0] + avg) / 2.0
    elif pooling_strategy == 'cls_max':
        maximum, _ = torch.max(outputs * inputs["attention_mask"][:, :, None], dim=1)
        outputs = (outputs[:, 0] + maximum) / 2.0
    elif pooling_strategy == 'last':
        batch_size = inputs['input_ids'].shape[0]
        sequence_lengths = -1 if padding_strategy == 'left' else inputs["attention_mask"].sum(dim=1) - 1
        outputs = outputs[torch.arange(batch_size, device=outputs.device), sequence_lengths]
Why are the outputs being averaged with the attention_mask? The HF outputs shown in ‘ouputs’ above only give the Ehall pass output for the sequence_length so in that case can we just skip the attention_mask?

Finally, given 2 above, what do current LLMs output as an embedding? For example OpenAI has an embeddings API that returns the embedding of a text. What approach are they using to calculate this embedding?

What would you consider as the embedding of a string from an LLM?

To derive a single embedding from an LLM, you typically pool the hidden states using strategies like averaging the embeddings of all tokens, using the [CLS] token’s embedding, or other methods such as max pooling. The pooling approach often depends on the task and model design. Attention masks are used during pooling to avoid the influence of padding tokens, but may be less relevant for strategies like [CLS]. OpenAI’s embedding API likely uses a similar approach, combining hidden states with learned pooling techniques optimized for their specific use cases.

Topic		Replies	Views
Embeddings from llama2 🤗Transformers	6	12374	December 13, 2023
Getting the same embedding from llama 2 class token for any input 🤗Transformers	1	1300	December 4, 2023
Embeddings from the Decoder only model Research	5	1530	March 26, 2025
LLM and different embeddings interaction Beginners	0	660	October 17, 2023
Mistral model generates the same embeddings for different input texts 🤗Transformers	2	343	April 12, 2024

What is an embedding?

Related topics