How does Q, K, V differ in LLM?

ranjanbhasin · May 27, 2025, 7:11pm

How does Q, K, V differ when it gets created from Input in an Encoder of Transformer architecture ? I read somewhere that these are 3 copies of the input sequence. If yes, then are Q,K,V are exactly same as input sequence or not ? If not, then how do they differ ?

John6666 · May 28, 2025, 3:58am

I had no idea, so I asked Hugging Chat.

by Hugging Chat: HuggingChat

In the Transformer architecture, Query (Q), Key (K), and Value (V) are not exact copies of the input sequence. Instead, they are learned linear projections of the input, created by multiplying the input embeddings with different weight matrices (Wₐ, Wₖ, Wᵥ) that are learned during training. Here’s how they differ and function:

Key Differences

Learned Projections:
- The input sequence (e.g., token embeddings) is transformed into three distinct representations using separate weight matrices:
  - Q: Q = X \cdot W_Q
  - K: K = X \cdot W_K
  - V: V = X \cdot W_V
- These weight matrices are initialized randomly and optimized during training to capture task-specific relationships.
Purpose in Attention:
- Query (Q): Represents the current token’s “query” for finding relevant tokens in the sequence.
- Key (K): Represents the “key” each token provides to match against queries.
- Value (V): Represents the actual information content of each token to be aggregated.
Role in Attention Mechanism:
- Attention Scores: \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
- Q and K are used to compute similarity scores (e.g., dot products), which determine how much each token attends to others.
- V is the source of values weighted by these attention scores.

Why Separate Projections?

Flexibility: Different linear projections allow the model to learn distinct representations for querying, key-matching, and value aggregation.
Contextual Relationships: For example, a token’s “query” representation might focus on its role as a subject, while its “key” representation highlights its role as an object.

Example

Suppose the input is a sequence of word embeddings X = [x_1, x_2, x_3] :

Each token x_i is transformed into:
- q_i = x_i \cdot W_Q (query),
- k_i = x_i \cdot W_K (key),
- v_i = x_i \cdot W_V (value).
Even though q_i, k_i, v_i originate from the same x_i , their projections via W_Q, W_K, W_V make them distinct.

Summary

Q, K, V are not exact copies: They are transformed versions of the input using task-specific learned weights.
Differences arise from: Separate weight matrices that enable the attention mechanism to focus on different aspects of the input.
Necessity: Without distinct projections, the attention mechanism would lose its ability to model complex relationships (e.g., dependencies between distant tokens).

This design allows the Transformer to dynamically adjust which parts of the input sequence are emphasized during processing, enabling powerful contextual understanding.

https://huggingface.co/chat/conversation/683688b4f285d484104070fd

Topic		Replies	Views
Past_key_values - why not past_key_values_queries? Beginners	5	10950	October 15, 2023
Transformer architecture and theory Beginners	3	687	February 15, 2022
The Essence of QKV Models: Causal Relationships, Backward Learning, and Logic Generation Research	0	24	April 24, 2025
Question About Attention Score Computation & Intuition 🤗Transformers	1	1694	January 12, 2021
Getting Q, K, V matrices of a ViT 🤗Transformers	0	158	July 5, 2023

How does Q, K, V differ in LLM?

Key Differences

Why Separate Projections?

Example

Summary

Related topics