Efficient batch inference using stacked past_key_values for multiple continuation candidates

noa-shavit · June 9, 2025, 9:16am

Efficient batch inference using stacked `past_key_values` for multiple continuation candidates

Hi all,

I’m working on a task where, for each position in a sequence, I want to evaluate multiple possible token continuations. These continuations share the same prefix up to that point.

To speed things up, I compute the past_key_values incrementally for each position, and then for all token candidates at that position, I reuse the same cache. I collect all candidate tokens into a batch and run a single forward pass using their individual input_ids and a stacked version of their corresponding past_key_values.

Here’s a simplified description of the approach:

For each position `t` in a sequence:

Compute past_key_values up to that point.
For each candidate token:
- Create a one-token input_ids with that candidate.
- Store the candidate’s input_ids and the corresponding past_key_values.
Stack all candidate inputs and their past_key_values into a batch.
Run: model(input_ids=batch_input_ids, past_key_values=stacked_past, use_cache=False)

My question is:

Is this approach aligned with how past_key_values are expected to be used in batch inference?
Are there potential pitfalls I should be aware of when batching multiple instances that share different past_key_values but the same position in the context?

Any references to examples using similar logic would be appreciated!

Thanks in advance

def predict_each_candidate(cur_topk_idx, cur_ids, model, lm_head): 
    # Efficiently evaluate multiple candidate tokens for each position in cur_ids,
    # using stacked past_key_values and batch inference.
    past = None
    candidate_batches = []
    candidate_pasts = []
    candidate_map = []

    for t, (gt_tok, pos_tok_list) in enumerate(zip(cur_ids, cur_topk_idx)):
        gt_tok_tensor = gt_tok.view(1, 1).to(device)
        with torch.inference_mode():
            out = model(input_ids=gt_tok_tensor, past_key_values=past, use_cache=True)
        past = out.past_key_values

        for option_i, tok in enumerate(pos_tok_list):
            if tok.item() in list_ids:
                input_ids = torch.tensor([[tok.item()]], device=device)
                candidate_batches.append(input_ids)
                candidate_pasts.append(past)
                candidate_map.append((t, option_i))

    if len(candidate_batches) > 0:
        batch_input_ids = torch.cat(candidate_batches, dim=0)

        def stack_past_key_values(past_list):
            num_layers = len(past_list[0])
            stacked = []
            for layer_idx in range(num_layers):
                keys = torch.stack([p[layer_idx][0] for p in past_list], dim=0)
                values = torch.stack([p[layer_idx][1] for p in past_list], dim=0)
                stacked.append((keys, values))
            return tuple(stacked)

        stacked_past = stack_past_key_values(candidate_pasts)

        with torch.inference_mode():
            out = model(input_ids=batch_input_ids, past_key_values=stacked_past, use_cache=False)
            logits = lm_head(out.last_hidden_state)

Pimpcat-AU · June 10, 2025, 7:46pm

def predict_each_candidate(cur_topk_idx, cur_ids, model, lm_head, list_ids, device):
“”"
Efficiently evaluate multiple candidate tokens for each position in cur_ids,
using stacked past_key_values and batch inference.
“”"
past = None
candidate_batches =
candidate_pasts =
candidate_map =

for t, (gt_tok, pos_tok_list) in enumerate(zip(cur_ids, cur_topk_idx)):
    gt_tok_tensor = gt_tok.view(1, 1).to(device)
    with torch.inference_mode():
        out = model(input_ids=gt_tok_tensor, past_key_values=past, use_cache=True)
    past = out.past_key_values

    for option_i, tok in enumerate(pos_tok_list):
        if tok.item() in list_ids:
            input_ids = torch.tensor([[tok.item()]], device=device)
            candidate_batches.append(input_ids)
            candidate_pasts.append(past)
            candidate_map.append((t, option_i))

if len(candidate_batches) > 0:
    batch_input_ids = torch.cat(candidate_batches, dim=0)

    def stack_past_key_values(past_list):
        num_layers = len(past_list[0])
        stacked = []
        for layer_idx in range(num_layers):
            # Stacks [num_candidates, ...] for each layer
            keys = torch.stack([p[layer_idx][0] for p in past_list], dim=0)
            values = torch.stack([p[layer_idx][1] for p in past_list], dim=0)
            stacked.append((keys, values))
        return tuple(stacked)

    stacked_past = stack_past_key_values(candidate_pasts)

    with torch.inference_mode():
        out = model(input_ids=batch_input_ids, past_key_values=stacked_past, use_cache=False)
        logits = lm_head(out.last_hidden_state)
    return logits, candidate_map  # Optionally return mapping
else:
    return None, None

Solution provided by Triskel Data Deterministic AI.

Topic		Replies	Views
Is There a Way to Improve Memory Usage When Using Identical `past_key_values` for All Samples in a Batch? 🤗Transformers	3	386	October 21, 2024
Past_key_value with multiple new tokens Intermediate	1	1325	August 10, 2023
Forge synthetic past_key_value batch from multiple outputs Intermediate	0	469	May 12, 2021
Outputs change if re-using KVCache (past_key_values) for model.forward and generation 🤗Transformers	5	169	January 22, 2025
Does model supports partial `past_key_values`? 🤗Transformers	0	429	May 12, 2023

Efficient batch inference using stacked past_key_values for multiple continuation candidates