What is an embedding?

I am confused about how to determine the best embedding for a given entity from an LLM. If i run:

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto")
outputs = model(**inputs, output_hidden_states=True)
hidden_states = outputs.hidden_states

For Llama2 and Mistral, there are 33 hidden states each of shape [batch_size, number_of_tokens, embedding_size]. Ultimately, when I think of “an” embedding, I am expecting something with the shape [batch_size, embedding_size]. (or just [1, embedding_size]). How do I do this conversion?

  1. Averaging and Pooling

The function get_pooling from here seems to suggest that there could be several ways of doing this. Here is a code snippet:

:param outputs:  torch.Tensor. Model outputs (without pooling)
    :param inputs:  Dict. Model inputs
    :param pooling_strategy:  str. Pooling strategy ['cls', 'cls_avg', 'cls_max', 'last', 'avg', 'max', 'all', index]
    :param padding_strategy:  str. Padding strategy of tokenizers (`left` or `right`).
        It can be obtained by `tokenizer.padding_side`.
    """
    if pooling_strategy == 'cls':
        outputs = outputs[:, 0]
    elif pooling_strategy == 'cls_avg':
        avg = torch.sum(
            outputs * inputs["attention_mask"][:, :, None], dim=1) / torch.sum(inputs["attention_mask"])
        outputs = (outputs[:, 0] + avg) / 2.0
    elif pooling_strategy == 'cls_max':
        maximum, _ = torch.max(outputs * inputs["attention_mask"][:, :, None], dim=1)
        outputs = (outputs[:, 0] + maximum) / 2.0
    elif pooling_strategy == 'last':
        batch_size = inputs['input_ids'].shape[0]
        sequence_lengths = -1 if padding_strategy == 'left' else inputs["attention_mask"].sum(dim=1) - 1
        outputs = outputs[torch.arange(batch_size, device=outputs.device), sequence_lengths]

Why are the outputs being averaged with the attention_mask? The HF outputs shown in ‘ouputs’ above only give the output for the sequence_length so in that case can we just skip the attention_mask?

  1. Finally, given 2 above, what do current LLMs output as an embedding? For example OpenAI has an embeddings API that returns the embedding of a text. What approach are they using to calculate this embedding?

What would you consider as the embedding of a string from an LLM?

1 Like

A string input s passing through a Llama model first gets tokenized via the Llama tokenizer into a vector of 1 x num_tokens. It then is embedded into latent space (1 x num_tokens x embedding_size) via the Embedding layer of the model. This embedding is also the first hidden state that you can see with hidden_states[0]. Then, this embedding gets passed through 32 consecutive Llama layers consisting of self-attention, layernorms. The output of the i-th layer is hidden_states[i].

As suggested by the code, to get a representation of the entire string, you can pool the embeddings of individual tokens that make up the string. The outputs are likely averaged with the attention mask to remove any padding tokens that could have been included with the input.

1 Like

Hi @vinven7

Embeddings from LLMs like Llama2 and Mistral are derived from multi-layered hidden states, typically shaped as [batch_size, number_of_tokens, embedding_size].

To obtain a single embedding per sequence, pooling strategies like averaging or max-pooling are applied, often using attention_mask to handle variable-length sequences.

Platforms like OpenAI use similar techniques to generate embeddings that summarize semantic and contextual information from text inputs.

Hope this helps :slight_smile:

1 Like

I wrote an article a while back which may help you to conceptualise an embedding. But in short, they are just a representation of the input that contains semantic and contextual information.

What is an Embedding Anyway? | N E R | D S (medium.com)

To derive a single embedding from an LLM, you typically pool the hidden states using strategies like averaging the embeddings of all tokens, using the [CLS] token’s embedding, or other methods such as max pooling. The pooling approach often depends on the task and model design. Attention masks are used during pooling to avoid the influence of padding tokens, but may be less relevant for strategies like [CLS]. OpenAI’s embedding API likely uses a similar approach, combining hidden states with learned pooling techniques optimized for their specific use cases.